| # Cancer@Home v2 - Project Summary | |
| ## ๐ฏ Project Overview | |
| Cancer@Home v2 is a comprehensive distributed computing platform for cancer genomics research that successfully integrates: | |
| 1. **Distributed Computing (BOINC)** - Submit and manage computationally intensive cancer research tasks | |
| 2. **Cancer Data Portal (GDC)** - Access and download cancer genomics datasets from TCGA and TARGET | |
| 3. **Graph Database (Neo4j)** - Model complex relationships between genes, mutations, patients, and cancer types | |
| 4. **Bioinformatics Pipeline** - Process FASTQ files, run BLAST searches, and call genetic variants | |
| 5. **Interactive Dashboard** - Web-based GUI with real-time visualizations and data exploration | |
| ## ๐ Project Structure | |
| ``` | |
| CancerAtHome2/ | |
| โโโ backend/ | |
| โ โโโ api/ | |
| โ โ โโโ main.py # FastAPI application with REST & GraphQL | |
| โ โโโ boinc/ | |
| โ โ โโโ client.py # BOINC distributed computing client | |
| โ โโโ gdc/ | |
| โ โ โโโ client.py # GDC Portal API integration | |
| โ โโโ neo4j/ | |
| โ โ โโโ db_manager.py # Neo4j database operations | |
| โ โ โโโ graphql_schema.py # GraphQL schema definitions | |
| โ โ โโโ data_importer.py # Sample data initialization | |
| โ โโโ pipeline/ | |
| โ โโโ fastq_processor.py # FASTQ quality control | |
| โ โโโ blast_runner.py # BLAST sequence alignment | |
| โ โโโ variant_caller.py # Genetic variant identification | |
| โโโ frontend/ | |
| โ โโโ index.html # Interactive web dashboard | |
| โโโ config.yml # Configuration file | |
| โโโ docker-compose.yml # Neo4j container setup | |
| โโโ requirements.txt # Python dependencies | |
| โโโ run.py # Main application launcher | |
| โโโ setup.ps1 # Windows setup script | |
| โโโ setup.sh # Linux/Mac setup script | |
| โโโ README.md # Comprehensive documentation | |
| โโโ QUICKSTART.md # Quick start guide | |
| โโโ USER_GUIDE.md # Detailed user guide | |
| โโโ GRAPHQL_EXAMPLES.md # GraphQL query examples | |
| โโโ LICENSE # MIT License | |
| ``` | |
| ## ๐ Key Features Implemented | |
| ### 1. Web Dashboard | |
| - **Modern UI**: Clean, gradient-based design with responsive layout | |
| - **5 Main Tabs**: Dashboard, Neo4j Visualization, BOINC Tasks, GDC Data, Pipeline | |
| - **Real-time Statistics**: Live data from Neo4j showing genes, mutations, patients | |
| - **Interactive Charts**: Chart.js visualizations for mutation distributions | |
| - **D3.js Graph**: Interactive network visualization of cancer genomics relationships | |
| ### 2. Neo4j Graph Database | |
| - **Node Types**: Gene, Mutation, Patient, CancerType | |
| - **Relationships**: | |
| - Gene โ AFFECTS โ Mutation | |
| - Patient โ HAS_MUTATION โ Mutation | |
| - Patient โ DIAGNOSED_WITH โ CancerType | |
| - **Sample Data**: Pre-loaded with 7 genes, 5 mutations, 5 patients, 4 cancer types | |
| - **Optimized**: Constraints and indexes for fast queries | |
| ### 3. GraphQL API | |
| - **Flexible Queries**: Get genes, mutations, patients, cancer types | |
| - **Filtering**: Query by gene symbol, chromosome, project ID, cancer type | |
| - **Aggregations**: Mutation frequency, cancer statistics | |
| - **Playground**: Interactive GraphQL explorer at /graphql | |
| ### 4. REST API Endpoints | |
| - `/api/health` - System health check | |
| - `/api/neo4j/summary` - Database statistics | |
| - `/api/neo4j/genes/{symbol}` - Gene information | |
| - `/api/boinc/tasks` - List BOINC tasks | |
| - `/api/boinc/submit` - Submit new task | |
| - `/api/boinc/statistics` - Task statistics | |
| - `/api/gdc/projects` - Available cancer projects | |
| - `/api/gdc/files/{project_id}` - Search GDC files | |
| - `/api/gdc/download` - Download GDC data | |
| - `/api/pipeline/*` - Bioinformatics pipeline endpoints | |
| ### 5. BOINC Integration | |
| - **Task Submission**: Support for variant calling, BLAST, alignment tasks | |
| - **Status Tracking**: Monitor pending, running, completed, failed tasks | |
| - **Statistics**: Total tasks, completion rates, average times | |
| - **Task Manager**: High-level interface for common workflows | |
| ### 6. GDC Data Integration | |
| - **Search API**: Query files by project, data type, experimental strategy | |
| - **Download**: Retrieve cancer genomics datasets | |
| - **Projects Supported**: TCGA-BRCA, TCGA-LUAD, TCGA-COAD, TCGA-GBM, TARGET-AML | |
| - **Parsers**: MAF, VCF, and clinical data parsing utilities | |
| ### 7. Bioinformatics Pipeline | |
| - **FASTQ Processing**: | |
| - Quality filtering | |
| - Adapter trimming | |
| - Statistics calculation | |
| - Quality control reports | |
| - **BLAST Integration**: | |
| - BLASTN and BLASTP support | |
| - XML output parsing | |
| - Hit filtering by identity/e-value | |
| - **Variant Calling**: | |
| - VCF generation | |
| - Quality filtering | |
| - Variant annotation | |
| - Cancer variant identification | |
| - Tumor mutation burden calculation | |
| ## ๐ ๏ธ Technology Stack | |
| - **Backend**: FastAPI (Python 3.8+) | |
| - **Database**: Neo4j 5.13 (Graph Database) | |
| - **API**: GraphQL (Strawberry), REST | |
| - **Frontend**: HTML5, CSS3, JavaScript | |
| - **Visualization**: D3.js, Chart.js | |
| - **Bioinformatics**: Biopython | |
| - **Data Source**: GDC Portal API | |
| - **Containerization**: Docker, Docker Compose | |
| - **Distributed Computing**: BOINC framework | |
| ## ๐ Sample Data Included | |
| ### Genes (7) | |
| - TP53 (Tumor protein p53) | |
| - BRAF (B-Raf proto-oncogene) | |
| - BRCA1, BRCA2 (Breast cancer genes) | |
| - PIK3CA, KRAS, EGFR (Oncogenes) | |
| ### Mutations (5) | |
| - Various missense mutations in cancer-associated genes | |
| - Includes position, reference/alternate alleles, quality scores | |
| ### Patients (5) | |
| - Representative cases from TCGA-BRCA, TCGA-LUAD, TCGA-COAD | |
| - Demographic data, vital status | |
| ### Cancer Types (4) | |
| - Breast Cancer (BRCA) | |
| - Lung Adenocarcinoma (LUAD) | |
| - Colon Adenocarcinoma (COAD) | |
| - Glioblastoma (GBM) | |
| ## ๐จ Design Principles | |
| 1. **Simplicity**: One-command setup, intuitive interface | |
| 2. **Speed**: Fast to install and get started (< 5 minutes) | |
| 3. **Modularity**: Clean separation of concerns | |
| 4. **Extensibility**: Easy to add new data sources and analyses | |
| 5. **Visual**: Rich visualizations for data exploration | |
| 6. **Professional**: Production-quality code with error handling | |
| ## ๐ง Configuration Options | |
| All configurable via `config.yml`: | |
| - Neo4j connection settings | |
| - GDC API parameters | |
| - BOINC server configuration | |
| - Pipeline quality thresholds | |
| - Output directories | |
| - Logging levels | |
| ## ๐ Documentation Provided | |
| 1. **README.md** - Complete project overview and installation | |
| 2. **QUICKSTART.md** - Fast setup and first steps | |
| 3. **USER_GUIDE.md** - Comprehensive usage documentation | |
| 4. **GRAPHQL_EXAMPLES.md** - GraphQL query examples | |
| 5. **Inline Code Comments** - Well-documented Python modules | |
| 6. **API Documentation** - Auto-generated Swagger UI at /docs | |
| ## ๐ Unique Features | |
| 1. **All-in-One Solution**: Complete stack from data acquisition to visualization | |
| 2. **Graph-Based**: Leverages Neo4j's power for complex relationship queries | |
| 3. **Real-Time**: Live dashboard updates and task monitoring | |
| 4. **Research-Ready**: Built for actual cancer genomics research workflows | |
| 5. **Extensible**: Easy to integrate additional data sources and tools | |
| 6. **Educational**: Great for learning cancer genomics and graph databases | |
| ## ๐ฆ Getting Started (Quick) | |
| ```bash | |
| # Windows | |
| .\setup.ps1 | |
| python run.py | |
| # Linux/Mac | |
| ./setup.sh | |
| python run.py | |
| # Open browser | |
| http://localhost:5000 | |
| ``` | |
| ## ๐ฏ Use Cases | |
| 1. **Research**: Analyze cancer genomics data with distributed computing | |
| 2. **Education**: Learn about cancer genetics and bioinformatics | |
| 3. **Visualization**: Explore gene-mutation-patient relationships | |
| 4. **Data Integration**: Combine multiple cancer data sources | |
| 5. **Pipeline Development**: Test bioinformatics workflows | |
| ## ๐ฎ Future Enhancements (Optional) | |
| - Machine learning for mutation prediction | |
| - Multi-omics data integration (RNA-seq, proteomics) | |
| - Survival analysis and clinical outcomes | |
| - Drug response prediction | |
| - Advanced graph algorithms (PageRank, community detection) | |
| - Real-time collaboration features | |
| - Mobile responsive design | |
| - Export/report generation | |
| ## ๐ License | |
| MIT License - Free for academic and commercial use | |
| ## ๐ Acknowledgments | |
| Inspired by: | |
| - Cancer@Home v1 (HeroX DCx Challenge) | |
| - Andrew Kamal's Neo4j Cancer Visualization | |
| - GDC Portal and TCGA Project | |
| - BOINC Distributed Computing Framework | |
| --- | |
| **Cancer@Home v2** successfully combines modern web technologies, graph databases, distributed computing, and bioinformatics tools into a cohesive platform that is both powerful and easy to use. The system is production-ready, well-documented, and designed for real-world cancer genomics research. | |