CancerAtHomeV2 / PROJECT_SUMMARY.md
Mentors4EDU's picture
Upload 33 files
7a92197 verified

Cancer@Home v2 - Project Summary

๐ŸŽฏ Project Overview

Cancer@Home v2 is a comprehensive distributed computing platform for cancer genomics research that successfully integrates:

  1. Distributed Computing (BOINC) - Submit and manage computationally intensive cancer research tasks
  2. Cancer Data Portal (GDC) - Access and download cancer genomics datasets from TCGA and TARGET
  3. Graph Database (Neo4j) - Model complex relationships between genes, mutations, patients, and cancer types
  4. Bioinformatics Pipeline - Process FASTQ files, run BLAST searches, and call genetic variants
  5. Interactive Dashboard - Web-based GUI with real-time visualizations and data exploration

๐Ÿ“ Project Structure

CancerAtHome2/
โ”œโ”€โ”€ backend/
โ”‚   โ”œโ”€โ”€ api/
โ”‚   โ”‚   โ””โ”€โ”€ main.py                 # FastAPI application with REST & GraphQL
โ”‚   โ”œโ”€โ”€ boinc/
โ”‚   โ”‚   โ””โ”€โ”€ client.py               # BOINC distributed computing client
โ”‚   โ”œโ”€โ”€ gdc/
โ”‚   โ”‚   โ””โ”€โ”€ client.py               # GDC Portal API integration
โ”‚   โ”œโ”€โ”€ neo4j/
โ”‚   โ”‚   โ”œโ”€โ”€ db_manager.py          # Neo4j database operations
โ”‚   โ”‚   โ”œโ”€โ”€ graphql_schema.py      # GraphQL schema definitions
โ”‚   โ”‚   โ””โ”€โ”€ data_importer.py       # Sample data initialization
โ”‚   โ””โ”€โ”€ pipeline/
โ”‚       โ”œโ”€โ”€ fastq_processor.py     # FASTQ quality control
โ”‚       โ”œโ”€โ”€ blast_runner.py        # BLAST sequence alignment
โ”‚       โ””โ”€โ”€ variant_caller.py      # Genetic variant identification
โ”œโ”€โ”€ frontend/
โ”‚   โ””โ”€โ”€ index.html                 # Interactive web dashboard
โ”œโ”€โ”€ config.yml                     # Configuration file
โ”œโ”€โ”€ docker-compose.yml             # Neo4j container setup
โ”œโ”€โ”€ requirements.txt               # Python dependencies
โ”œโ”€โ”€ run.py                         # Main application launcher
โ”œโ”€โ”€ setup.ps1                      # Windows setup script
โ”œโ”€โ”€ setup.sh                       # Linux/Mac setup script
โ”œโ”€โ”€ README.md                      # Comprehensive documentation
โ”œโ”€โ”€ QUICKSTART.md                  # Quick start guide
โ”œโ”€โ”€ USER_GUIDE.md                  # Detailed user guide
โ”œโ”€โ”€ GRAPHQL_EXAMPLES.md            # GraphQL query examples
โ””โ”€โ”€ LICENSE                        # MIT License

๐Ÿš€ Key Features Implemented

1. Web Dashboard

  • Modern UI: Clean, gradient-based design with responsive layout
  • 5 Main Tabs: Dashboard, Neo4j Visualization, BOINC Tasks, GDC Data, Pipeline
  • Real-time Statistics: Live data from Neo4j showing genes, mutations, patients
  • Interactive Charts: Chart.js visualizations for mutation distributions
  • D3.js Graph: Interactive network visualization of cancer genomics relationships

2. Neo4j Graph Database

  • Node Types: Gene, Mutation, Patient, CancerType
  • Relationships:
    • Gene โ† AFFECTS โ† Mutation
    • Patient โ†’ HAS_MUTATION โ†’ Mutation
    • Patient โ†’ DIAGNOSED_WITH โ†’ CancerType
  • Sample Data: Pre-loaded with 7 genes, 5 mutations, 5 patients, 4 cancer types
  • Optimized: Constraints and indexes for fast queries

3. GraphQL API

  • Flexible Queries: Get genes, mutations, patients, cancer types
  • Filtering: Query by gene symbol, chromosome, project ID, cancer type
  • Aggregations: Mutation frequency, cancer statistics
  • Playground: Interactive GraphQL explorer at /graphql

4. REST API Endpoints

  • /api/health - System health check
  • /api/neo4j/summary - Database statistics
  • /api/neo4j/genes/{symbol} - Gene information
  • /api/boinc/tasks - List BOINC tasks
  • /api/boinc/submit - Submit new task
  • /api/boinc/statistics - Task statistics
  • /api/gdc/projects - Available cancer projects
  • /api/gdc/files/{project_id} - Search GDC files
  • /api/gdc/download - Download GDC data
  • /api/pipeline/* - Bioinformatics pipeline endpoints

5. BOINC Integration

  • Task Submission: Support for variant calling, BLAST, alignment tasks
  • Status Tracking: Monitor pending, running, completed, failed tasks
  • Statistics: Total tasks, completion rates, average times
  • Task Manager: High-level interface for common workflows

6. GDC Data Integration

  • Search API: Query files by project, data type, experimental strategy
  • Download: Retrieve cancer genomics datasets
  • Projects Supported: TCGA-BRCA, TCGA-LUAD, TCGA-COAD, TCGA-GBM, TARGET-AML
  • Parsers: MAF, VCF, and clinical data parsing utilities

7. Bioinformatics Pipeline

  • FASTQ Processing:

    • Quality filtering
    • Adapter trimming
    • Statistics calculation
    • Quality control reports
  • BLAST Integration:

    • BLASTN and BLASTP support
    • XML output parsing
    • Hit filtering by identity/e-value
  • Variant Calling:

    • VCF generation
    • Quality filtering
    • Variant annotation
    • Cancer variant identification
    • Tumor mutation burden calculation

๐Ÿ› ๏ธ Technology Stack

  • Backend: FastAPI (Python 3.8+)
  • Database: Neo4j 5.13 (Graph Database)
  • API: GraphQL (Strawberry), REST
  • Frontend: HTML5, CSS3, JavaScript
  • Visualization: D3.js, Chart.js
  • Bioinformatics: Biopython
  • Data Source: GDC Portal API
  • Containerization: Docker, Docker Compose
  • Distributed Computing: BOINC framework

๐Ÿ“Š Sample Data Included

Genes (7)

  • TP53 (Tumor protein p53)
  • BRAF (B-Raf proto-oncogene)
  • BRCA1, BRCA2 (Breast cancer genes)
  • PIK3CA, KRAS, EGFR (Oncogenes)

Mutations (5)

  • Various missense mutations in cancer-associated genes
  • Includes position, reference/alternate alleles, quality scores

Patients (5)

  • Representative cases from TCGA-BRCA, TCGA-LUAD, TCGA-COAD
  • Demographic data, vital status

Cancer Types (4)

  • Breast Cancer (BRCA)
  • Lung Adenocarcinoma (LUAD)
  • Colon Adenocarcinoma (COAD)
  • Glioblastoma (GBM)

๐ŸŽจ Design Principles

  1. Simplicity: One-command setup, intuitive interface
  2. Speed: Fast to install and get started (< 5 minutes)
  3. Modularity: Clean separation of concerns
  4. Extensibility: Easy to add new data sources and analyses
  5. Visual: Rich visualizations for data exploration
  6. Professional: Production-quality code with error handling

๐Ÿ”ง Configuration Options

All configurable via config.yml:

  • Neo4j connection settings
  • GDC API parameters
  • BOINC server configuration
  • Pipeline quality thresholds
  • Output directories
  • Logging levels

๐Ÿ“– Documentation Provided

  1. README.md - Complete project overview and installation
  2. QUICKSTART.md - Fast setup and first steps
  3. USER_GUIDE.md - Comprehensive usage documentation
  4. GRAPHQL_EXAMPLES.md - GraphQL query examples
  5. Inline Code Comments - Well-documented Python modules
  6. API Documentation - Auto-generated Swagger UI at /docs

๐ŸŒŸ Unique Features

  1. All-in-One Solution: Complete stack from data acquisition to visualization
  2. Graph-Based: Leverages Neo4j's power for complex relationship queries
  3. Real-Time: Live dashboard updates and task monitoring
  4. Research-Ready: Built for actual cancer genomics research workflows
  5. Extensible: Easy to integrate additional data sources and tools
  6. Educational: Great for learning cancer genomics and graph databases

๐Ÿšฆ Getting Started (Quick)

# Windows
.\setup.ps1
python run.py

# Linux/Mac
./setup.sh
python run.py

# Open browser
http://localhost:5000

๐ŸŽฏ Use Cases

  1. Research: Analyze cancer genomics data with distributed computing
  2. Education: Learn about cancer genetics and bioinformatics
  3. Visualization: Explore gene-mutation-patient relationships
  4. Data Integration: Combine multiple cancer data sources
  5. Pipeline Development: Test bioinformatics workflows

๐Ÿ”ฎ Future Enhancements (Optional)

  • Machine learning for mutation prediction
  • Multi-omics data integration (RNA-seq, proteomics)
  • Survival analysis and clinical outcomes
  • Drug response prediction
  • Advanced graph algorithms (PageRank, community detection)
  • Real-time collaboration features
  • Mobile responsive design
  • Export/report generation

๐Ÿ“ License

MIT License - Free for academic and commercial use

๐Ÿ™ Acknowledgments

Inspired by:

  • Cancer@Home v1 (HeroX DCx Challenge)
  • Andrew Kamal's Neo4j Cancer Visualization
  • GDC Portal and TCGA Project
  • BOINC Distributed Computing Framework

Cancer@Home v2 successfully combines modern web technologies, graph databases, distributed computing, and bioinformatics tools into a cohesive platform that is both powerful and easy to use. The system is production-ready, well-documented, and designed for real-world cancer genomics research.