File size: 9,003 Bytes
7a92197 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
# Cancer@Home v2 - Project Summary
## ๐ฏ Project Overview
Cancer@Home v2 is a comprehensive distributed computing platform for cancer genomics research that successfully integrates:
1. **Distributed Computing (BOINC)** - Submit and manage computationally intensive cancer research tasks
2. **Cancer Data Portal (GDC)** - Access and download cancer genomics datasets from TCGA and TARGET
3. **Graph Database (Neo4j)** - Model complex relationships between genes, mutations, patients, and cancer types
4. **Bioinformatics Pipeline** - Process FASTQ files, run BLAST searches, and call genetic variants
5. **Interactive Dashboard** - Web-based GUI with real-time visualizations and data exploration
## ๐ Project Structure
```
CancerAtHome2/
โโโ backend/
โ โโโ api/
โ โ โโโ main.py # FastAPI application with REST & GraphQL
โ โโโ boinc/
โ โ โโโ client.py # BOINC distributed computing client
โ โโโ gdc/
โ โ โโโ client.py # GDC Portal API integration
โ โโโ neo4j/
โ โ โโโ db_manager.py # Neo4j database operations
โ โ โโโ graphql_schema.py # GraphQL schema definitions
โ โ โโโ data_importer.py # Sample data initialization
โ โโโ pipeline/
โ โโโ fastq_processor.py # FASTQ quality control
โ โโโ blast_runner.py # BLAST sequence alignment
โ โโโ variant_caller.py # Genetic variant identification
โโโ frontend/
โ โโโ index.html # Interactive web dashboard
โโโ config.yml # Configuration file
โโโ docker-compose.yml # Neo4j container setup
โโโ requirements.txt # Python dependencies
โโโ run.py # Main application launcher
โโโ setup.ps1 # Windows setup script
โโโ setup.sh # Linux/Mac setup script
โโโ README.md # Comprehensive documentation
โโโ QUICKSTART.md # Quick start guide
โโโ USER_GUIDE.md # Detailed user guide
โโโ GRAPHQL_EXAMPLES.md # GraphQL query examples
โโโ LICENSE # MIT License
```
## ๐ Key Features Implemented
### 1. Web Dashboard
- **Modern UI**: Clean, gradient-based design with responsive layout
- **5 Main Tabs**: Dashboard, Neo4j Visualization, BOINC Tasks, GDC Data, Pipeline
- **Real-time Statistics**: Live data from Neo4j showing genes, mutations, patients
- **Interactive Charts**: Chart.js visualizations for mutation distributions
- **D3.js Graph**: Interactive network visualization of cancer genomics relationships
### 2. Neo4j Graph Database
- **Node Types**: Gene, Mutation, Patient, CancerType
- **Relationships**:
- Gene โ AFFECTS โ Mutation
- Patient โ HAS_MUTATION โ Mutation
- Patient โ DIAGNOSED_WITH โ CancerType
- **Sample Data**: Pre-loaded with 7 genes, 5 mutations, 5 patients, 4 cancer types
- **Optimized**: Constraints and indexes for fast queries
### 3. GraphQL API
- **Flexible Queries**: Get genes, mutations, patients, cancer types
- **Filtering**: Query by gene symbol, chromosome, project ID, cancer type
- **Aggregations**: Mutation frequency, cancer statistics
- **Playground**: Interactive GraphQL explorer at /graphql
### 4. REST API Endpoints
- `/api/health` - System health check
- `/api/neo4j/summary` - Database statistics
- `/api/neo4j/genes/{symbol}` - Gene information
- `/api/boinc/tasks` - List BOINC tasks
- `/api/boinc/submit` - Submit new task
- `/api/boinc/statistics` - Task statistics
- `/api/gdc/projects` - Available cancer projects
- `/api/gdc/files/{project_id}` - Search GDC files
- `/api/gdc/download` - Download GDC data
- `/api/pipeline/*` - Bioinformatics pipeline endpoints
### 5. BOINC Integration
- **Task Submission**: Support for variant calling, BLAST, alignment tasks
- **Status Tracking**: Monitor pending, running, completed, failed tasks
- **Statistics**: Total tasks, completion rates, average times
- **Task Manager**: High-level interface for common workflows
### 6. GDC Data Integration
- **Search API**: Query files by project, data type, experimental strategy
- **Download**: Retrieve cancer genomics datasets
- **Projects Supported**: TCGA-BRCA, TCGA-LUAD, TCGA-COAD, TCGA-GBM, TARGET-AML
- **Parsers**: MAF, VCF, and clinical data parsing utilities
### 7. Bioinformatics Pipeline
- **FASTQ Processing**:
- Quality filtering
- Adapter trimming
- Statistics calculation
- Quality control reports
- **BLAST Integration**:
- BLASTN and BLASTP support
- XML output parsing
- Hit filtering by identity/e-value
- **Variant Calling**:
- VCF generation
- Quality filtering
- Variant annotation
- Cancer variant identification
- Tumor mutation burden calculation
## ๐ ๏ธ Technology Stack
- **Backend**: FastAPI (Python 3.8+)
- **Database**: Neo4j 5.13 (Graph Database)
- **API**: GraphQL (Strawberry), REST
- **Frontend**: HTML5, CSS3, JavaScript
- **Visualization**: D3.js, Chart.js
- **Bioinformatics**: Biopython
- **Data Source**: GDC Portal API
- **Containerization**: Docker, Docker Compose
- **Distributed Computing**: BOINC framework
## ๐ Sample Data Included
### Genes (7)
- TP53 (Tumor protein p53)
- BRAF (B-Raf proto-oncogene)
- BRCA1, BRCA2 (Breast cancer genes)
- PIK3CA, KRAS, EGFR (Oncogenes)
### Mutations (5)
- Various missense mutations in cancer-associated genes
- Includes position, reference/alternate alleles, quality scores
### Patients (5)
- Representative cases from TCGA-BRCA, TCGA-LUAD, TCGA-COAD
- Demographic data, vital status
### Cancer Types (4)
- Breast Cancer (BRCA)
- Lung Adenocarcinoma (LUAD)
- Colon Adenocarcinoma (COAD)
- Glioblastoma (GBM)
## ๐จ Design Principles
1. **Simplicity**: One-command setup, intuitive interface
2. **Speed**: Fast to install and get started (< 5 minutes)
3. **Modularity**: Clean separation of concerns
4. **Extensibility**: Easy to add new data sources and analyses
5. **Visual**: Rich visualizations for data exploration
6. **Professional**: Production-quality code with error handling
## ๐ง Configuration Options
All configurable via `config.yml`:
- Neo4j connection settings
- GDC API parameters
- BOINC server configuration
- Pipeline quality thresholds
- Output directories
- Logging levels
## ๐ Documentation Provided
1. **README.md** - Complete project overview and installation
2. **QUICKSTART.md** - Fast setup and first steps
3. **USER_GUIDE.md** - Comprehensive usage documentation
4. **GRAPHQL_EXAMPLES.md** - GraphQL query examples
5. **Inline Code Comments** - Well-documented Python modules
6. **API Documentation** - Auto-generated Swagger UI at /docs
## ๐ Unique Features
1. **All-in-One Solution**: Complete stack from data acquisition to visualization
2. **Graph-Based**: Leverages Neo4j's power for complex relationship queries
3. **Real-Time**: Live dashboard updates and task monitoring
4. **Research-Ready**: Built for actual cancer genomics research workflows
5. **Extensible**: Easy to integrate additional data sources and tools
6. **Educational**: Great for learning cancer genomics and graph databases
## ๐ฆ Getting Started (Quick)
```bash
# Windows
.\setup.ps1
python run.py
# Linux/Mac
./setup.sh
python run.py
# Open browser
http://localhost:5000
```
## ๐ฏ Use Cases
1. **Research**: Analyze cancer genomics data with distributed computing
2. **Education**: Learn about cancer genetics and bioinformatics
3. **Visualization**: Explore gene-mutation-patient relationships
4. **Data Integration**: Combine multiple cancer data sources
5. **Pipeline Development**: Test bioinformatics workflows
## ๐ฎ Future Enhancements (Optional)
- Machine learning for mutation prediction
- Multi-omics data integration (RNA-seq, proteomics)
- Survival analysis and clinical outcomes
- Drug response prediction
- Advanced graph algorithms (PageRank, community detection)
- Real-time collaboration features
- Mobile responsive design
- Export/report generation
## ๐ License
MIT License - Free for academic and commercial use
## ๐ Acknowledgments
Inspired by:
- Cancer@Home v1 (HeroX DCx Challenge)
- Andrew Kamal's Neo4j Cancer Visualization
- GDC Portal and TCGA Project
- BOINC Distributed Computing Framework
---
**Cancer@Home v2** successfully combines modern web technologies, graph databases, distributed computing, and bioinformatics tools into a cohesive platform that is both powerful and easy to use. The system is production-ready, well-documented, and designed for real-world cancer genomics research.
|