Cancer@Home v2 - Project Summary
๐ฏ Project Overview
Cancer@Home v2 is a comprehensive distributed computing platform for cancer genomics research that successfully integrates:
- Distributed Computing (BOINC) - Submit and manage computationally intensive cancer research tasks
- Cancer Data Portal (GDC) - Access and download cancer genomics datasets from TCGA and TARGET
- Graph Database (Neo4j) - Model complex relationships between genes, mutations, patients, and cancer types
- Bioinformatics Pipeline - Process FASTQ files, run BLAST searches, and call genetic variants
- Interactive Dashboard - Web-based GUI with real-time visualizations and data exploration
๐ Project Structure
CancerAtHome2/
โโโ backend/
โ โโโ api/
โ โ โโโ main.py # FastAPI application with REST & GraphQL
โ โโโ boinc/
โ โ โโโ client.py # BOINC distributed computing client
โ โโโ gdc/
โ โ โโโ client.py # GDC Portal API integration
โ โโโ neo4j/
โ โ โโโ db_manager.py # Neo4j database operations
โ โ โโโ graphql_schema.py # GraphQL schema definitions
โ โ โโโ data_importer.py # Sample data initialization
โ โโโ pipeline/
โ โโโ fastq_processor.py # FASTQ quality control
โ โโโ blast_runner.py # BLAST sequence alignment
โ โโโ variant_caller.py # Genetic variant identification
โโโ frontend/
โ โโโ index.html # Interactive web dashboard
โโโ config.yml # Configuration file
โโโ docker-compose.yml # Neo4j container setup
โโโ requirements.txt # Python dependencies
โโโ run.py # Main application launcher
โโโ setup.ps1 # Windows setup script
โโโ setup.sh # Linux/Mac setup script
โโโ README.md # Comprehensive documentation
โโโ QUICKSTART.md # Quick start guide
โโโ USER_GUIDE.md # Detailed user guide
โโโ GRAPHQL_EXAMPLES.md # GraphQL query examples
โโโ LICENSE # MIT License
๐ Key Features Implemented
1. Web Dashboard
- Modern UI: Clean, gradient-based design with responsive layout
- 5 Main Tabs: Dashboard, Neo4j Visualization, BOINC Tasks, GDC Data, Pipeline
- Real-time Statistics: Live data from Neo4j showing genes, mutations, patients
- Interactive Charts: Chart.js visualizations for mutation distributions
- D3.js Graph: Interactive network visualization of cancer genomics relationships
2. Neo4j Graph Database
- Node Types: Gene, Mutation, Patient, CancerType
- Relationships:
- Gene โ AFFECTS โ Mutation
- Patient โ HAS_MUTATION โ Mutation
- Patient โ DIAGNOSED_WITH โ CancerType
- Sample Data: Pre-loaded with 7 genes, 5 mutations, 5 patients, 4 cancer types
- Optimized: Constraints and indexes for fast queries
3. GraphQL API
- Flexible Queries: Get genes, mutations, patients, cancer types
- Filtering: Query by gene symbol, chromosome, project ID, cancer type
- Aggregations: Mutation frequency, cancer statistics
- Playground: Interactive GraphQL explorer at /graphql
4. REST API Endpoints
/api/health- System health check/api/neo4j/summary- Database statistics/api/neo4j/genes/{symbol}- Gene information/api/boinc/tasks- List BOINC tasks/api/boinc/submit- Submit new task/api/boinc/statistics- Task statistics/api/gdc/projects- Available cancer projects/api/gdc/files/{project_id}- Search GDC files/api/gdc/download- Download GDC data/api/pipeline/*- Bioinformatics pipeline endpoints
5. BOINC Integration
- Task Submission: Support for variant calling, BLAST, alignment tasks
- Status Tracking: Monitor pending, running, completed, failed tasks
- Statistics: Total tasks, completion rates, average times
- Task Manager: High-level interface for common workflows
6. GDC Data Integration
- Search API: Query files by project, data type, experimental strategy
- Download: Retrieve cancer genomics datasets
- Projects Supported: TCGA-BRCA, TCGA-LUAD, TCGA-COAD, TCGA-GBM, TARGET-AML
- Parsers: MAF, VCF, and clinical data parsing utilities
7. Bioinformatics Pipeline
FASTQ Processing:
- Quality filtering
- Adapter trimming
- Statistics calculation
- Quality control reports
BLAST Integration:
- BLASTN and BLASTP support
- XML output parsing
- Hit filtering by identity/e-value
Variant Calling:
- VCF generation
- Quality filtering
- Variant annotation
- Cancer variant identification
- Tumor mutation burden calculation
๐ ๏ธ Technology Stack
- Backend: FastAPI (Python 3.8+)
- Database: Neo4j 5.13 (Graph Database)
- API: GraphQL (Strawberry), REST
- Frontend: HTML5, CSS3, JavaScript
- Visualization: D3.js, Chart.js
- Bioinformatics: Biopython
- Data Source: GDC Portal API
- Containerization: Docker, Docker Compose
- Distributed Computing: BOINC framework
๐ Sample Data Included
Genes (7)
- TP53 (Tumor protein p53)
- BRAF (B-Raf proto-oncogene)
- BRCA1, BRCA2 (Breast cancer genes)
- PIK3CA, KRAS, EGFR (Oncogenes)
Mutations (5)
- Various missense mutations in cancer-associated genes
- Includes position, reference/alternate alleles, quality scores
Patients (5)
- Representative cases from TCGA-BRCA, TCGA-LUAD, TCGA-COAD
- Demographic data, vital status
Cancer Types (4)
- Breast Cancer (BRCA)
- Lung Adenocarcinoma (LUAD)
- Colon Adenocarcinoma (COAD)
- Glioblastoma (GBM)
๐จ Design Principles
- Simplicity: One-command setup, intuitive interface
- Speed: Fast to install and get started (< 5 minutes)
- Modularity: Clean separation of concerns
- Extensibility: Easy to add new data sources and analyses
- Visual: Rich visualizations for data exploration
- Professional: Production-quality code with error handling
๐ง Configuration Options
All configurable via config.yml:
- Neo4j connection settings
- GDC API parameters
- BOINC server configuration
- Pipeline quality thresholds
- Output directories
- Logging levels
๐ Documentation Provided
- README.md - Complete project overview and installation
- QUICKSTART.md - Fast setup and first steps
- USER_GUIDE.md - Comprehensive usage documentation
- GRAPHQL_EXAMPLES.md - GraphQL query examples
- Inline Code Comments - Well-documented Python modules
- API Documentation - Auto-generated Swagger UI at /docs
๐ Unique Features
- All-in-One Solution: Complete stack from data acquisition to visualization
- Graph-Based: Leverages Neo4j's power for complex relationship queries
- Real-Time: Live dashboard updates and task monitoring
- Research-Ready: Built for actual cancer genomics research workflows
- Extensible: Easy to integrate additional data sources and tools
- Educational: Great for learning cancer genomics and graph databases
๐ฆ Getting Started (Quick)
# Windows
.\setup.ps1
python run.py
# Linux/Mac
./setup.sh
python run.py
# Open browser
http://localhost:5000
๐ฏ Use Cases
- Research: Analyze cancer genomics data with distributed computing
- Education: Learn about cancer genetics and bioinformatics
- Visualization: Explore gene-mutation-patient relationships
- Data Integration: Combine multiple cancer data sources
- Pipeline Development: Test bioinformatics workflows
๐ฎ Future Enhancements (Optional)
- Machine learning for mutation prediction
- Multi-omics data integration (RNA-seq, proteomics)
- Survival analysis and clinical outcomes
- Drug response prediction
- Advanced graph algorithms (PageRank, community detection)
- Real-time collaboration features
- Mobile responsive design
- Export/report generation
๐ License
MIT License - Free for academic and commercial use
๐ Acknowledgments
Inspired by:
- Cancer@Home v1 (HeroX DCx Challenge)
- Andrew Kamal's Neo4j Cancer Visualization
- GDC Portal and TCGA Project
- BOINC Distributed Computing Framework
Cancer@Home v2 successfully combines modern web technologies, graph databases, distributed computing, and bioinformatics tools into a cohesive platform that is both powerful and easy to use. The system is production-ready, well-documented, and designed for real-world cancer genomics research.