| # Cancer@Home v2 - User Guide | |
| ## Table of Contents | |
| 1. [Introduction](#introduction) | |
| 2. [System Architecture](#system-architecture) | |
| 3. [Getting Started](#getting-started) | |
| 4. [Dashboard Guide](#dashboard-guide) | |
| 5. [Working with Data](#working-with-data) | |
| 6. [Analysis Pipeline](#analysis-pipeline) | |
| 7. [Advanced Usage](#advanced-usage) | |
| --- | |
| ## Introduction | |
| Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines: | |
| - **BOINC**: Distributed computing for computationally intensive tasks | |
| - **GDC Portal**: Access to comprehensive cancer genomics datasets | |
| - **Neo4j**: Graph database for modeling complex relationships | |
| - **Bioinformatics Pipeline**: FASTQ processing, BLAST alignment, and variant calling | |
| ### Key Features | |
| β Interactive web dashboard | |
| β Real-time graph visualization | |
| β GraphQL API for flexible data queries | |
| β Distributed task processing | |
| β Cancer genomics data integration | |
| --- | |
| ## System Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Web Dashboard (Port 5000) β | |
| β Dashboard | Neo4j Viz | BOINC | GDC | Pipelineβ | |
| ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ | |
| β FastAPI Backend (REST + GraphQL) β | |
| βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββββββββββ | |
| β β β β β | |
| ββββ΄ββ ββββ΄ββ ββββ΄ββ ββββ΄ββ ββββ΄βββββββ | |
| βNeo4jβ βBOINCβ βGDC β βFASTQβ βBLAST/VCFβ | |
| β7687 β βClientβ βAPI β βProc β β Caller β | |
| βββββββ βββββββ ββββββ βββββββ βββββββββββ | |
| ``` | |
| --- | |
| ## Getting Started | |
| ### Quick Installation (5 minutes) | |
| **Windows:** | |
| ```powershell | |
| .\setup.ps1 | |
| python run.py | |
| ``` | |
| **Linux/Mac:** | |
| ```bash | |
| ./setup.sh | |
| python run.py | |
| ``` | |
| ### Access Points | |
| - **Main Application**: http://localhost:5000 | |
| - **API Documentation**: http://localhost:5000/docs | |
| - **GraphQL Playground**: http://localhost:5000/graphql | |
| - **Neo4j Browser**: http://localhost:7474 (neo4j/cancer123) | |
| --- | |
| ## Dashboard Guide | |
| ### 1. Overview Tab | |
| Shows key statistics: | |
| - Total genes in database | |
| - Total mutations identified | |
| - Number of patients | |
| - Cancer types catalogued | |
| **Chart**: Mutation distribution across cancer types | |
| ### 2. Neo4j Visualization Tab | |
| Interactive graph showing: | |
| - **Blue nodes**: Genes (TP53, BRCA1, KRAS, etc.) | |
| - **Purple nodes**: Patients | |
| - **Pink nodes**: Cancer types | |
| - **Lines**: Relationships between entities | |
| **Navigation**: | |
| - Click and drag nodes to rearrange | |
| - Hover over nodes for details | |
| - Zoom in/out with mouse wheel | |
| ### 3. BOINC Tasks Tab | |
| Manage distributed computing workloads: | |
| **Submit Task**: | |
| 1. Select task type (Variant Calling, BLAST, Alignment) | |
| 2. Enter input file path | |
| 3. Click "Submit Task" | |
| **Monitor Tasks**: | |
| - View all tasks with status (Pending, Running, Completed) | |
| - See task creation time and type | |
| - Check overall statistics | |
| ### 4. GDC Data Tab | |
| Browse available cancer projects: | |
| - TCGA-BRCA: Breast Cancer (1,098 cases) | |
| - TCGA-LUAD: Lung Adenocarcinoma (585 cases) | |
| - TCGA-COAD: Colon Adenocarcinoma (461 cases) | |
| - TCGA-GBM: Glioblastoma (617 cases) | |
| - TARGET-AML: Acute Myeloid Leukemia (238 cases) | |
| Click on a project to explore available datasets. | |
| ### 5. Pipeline Tab | |
| Quick access to bioinformatics tools: | |
| - **FASTQ QC**: Quality control for sequencing data | |
| - **BLAST Search**: Sequence alignment and homology | |
| - **Variant Calling**: Identify genetic variants | |
| --- | |
| ## Working with Data | |
| ### Querying with GraphQL | |
| Access the GraphQL playground at http://localhost:5000/graphql | |
| **Example 1: Find mutations in TP53 gene** | |
| ```graphql | |
| query { | |
| mutations(gene: "TP53") { | |
| mutation_id | |
| chromosome | |
| position | |
| consequence | |
| } | |
| } | |
| ``` | |
| **Example 2: Get patient information** | |
| ```graphql | |
| query { | |
| patients(project_id: "TCGA-BRCA", limit: 10) { | |
| patient_id | |
| age | |
| gender | |
| vital_status | |
| } | |
| } | |
| ``` | |
| **Example 3: Cancer statistics** | |
| ```graphql | |
| query { | |
| cancerStatistics(cancer_type_id: "BRCA") { | |
| total_patients | |
| total_mutations | |
| avg_mutations_per_patient | |
| } | |
| } | |
| ``` | |
| ### Using the REST API | |
| **Get database summary:** | |
| ```bash | |
| curl http://localhost:5000/api/neo4j/summary | |
| ``` | |
| **Search GDC files:** | |
| ```bash | |
| curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10" | |
| ``` | |
| **Submit BOINC task:** | |
| ```bash | |
| curl -X POST http://localhost:5000/api/boinc/submit \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}' | |
| ``` | |
| --- | |
| ## Analysis Pipeline | |
| ### 1. FASTQ Processing | |
| **Quality Control:** | |
| ```python | |
| from backend.pipeline import FASTQProcessor | |
| processor = FASTQProcessor() | |
| stats = processor.calculate_statistics("input.fastq") | |
| print(f"Total reads: {stats['total_reads']}") | |
| print(f"Average quality: {stats['avg_quality']}") | |
| ``` | |
| **Filter by quality:** | |
| ```python | |
| filtered = processor.quality_filter("input.fastq", "filtered.fastq") | |
| print(f"Pass rate: {filtered['pass_rate']:.2%}") | |
| ``` | |
| ### 2. BLAST Alignment | |
| **Run BLAST search:** | |
| ```python | |
| from backend.pipeline import BLASTRunner | |
| blast = BLASTRunner() | |
| results = blast.run_blastn("query.fasta") | |
| hits = blast.parse_results(results) | |
| print(f"Found {len(hits)} alignments") | |
| ``` | |
| **Filter high-quality hits:** | |
| ```python | |
| filtered_hits = blast.filter_hits(hits, min_identity=0.95) | |
| ``` | |
| ### 3. Variant Calling | |
| **Identify variants:** | |
| ```python | |
| from backend.pipeline import VariantCaller | |
| caller = VariantCaller() | |
| vcf_file = caller.call_variants("alignment.bam", "reference.fa") | |
| variants = caller.filter_variants(vcf_file, min_quality=30) | |
| print(f"Identified {len(variants)} high-quality variants") | |
| ``` | |
| **Find cancer-associated variants:** | |
| ```python | |
| from backend.pipeline import VariantAnalyzer | |
| analyzer = VariantAnalyzer() | |
| cancer_variants = analyzer.identify_cancer_variants(variants) | |
| tmb = analyzer.calculate_mutation_burden(variants) | |
| print(f"Cancer variants: {len(cancer_variants)}") | |
| print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb") | |
| ``` | |
| --- | |
| ## Advanced Usage | |
| ### Custom Neo4j Queries | |
| **Direct Cypher queries:** | |
| ```python | |
| from backend.neo4j import DatabaseManager | |
| db = DatabaseManager() | |
| # Find patients with TP53 mutations | |
| query = """ | |
| MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'}) | |
| RETURN p.patient_id, m.position, m.consequence | |
| """ | |
| results = db.execute_query(query) | |
| for result in results: | |
| print(result) | |
| db.close() | |
| ``` | |
| ### Batch Data Import | |
| **Import GDC data:** | |
| ```python | |
| from backend.gdc import GDCClient | |
| from backend.neo4j import DataImporter | |
| # Download mutation data | |
| gdc = GDCClient() | |
| files = gdc.get_mutation_data("TCGA-BRCA", limit=10) | |
| for file in files: | |
| gdc.download_file(file.file_id) | |
| # Import to Neo4j | |
| importer = DataImporter() | |
| importer.import_gdc_data(files) | |
| ``` | |
| ### Custom BOINC Tasks | |
| **Submit custom analysis:** | |
| ```python | |
| from backend.boinc import BOINCClient | |
| client = BOINCClient() | |
| # Submit multiple tasks | |
| input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"] | |
| task_ids = [] | |
| for file in input_files: | |
| task_id = client.submit_task("variant_calling", file) | |
| task_ids.append(task_id) | |
| # Monitor progress | |
| for task_id in task_ids: | |
| status = client.get_task_status(task_id) | |
| print(f"Task {task_id}: {status.status}") | |
| ``` | |
| ### Configuration Customization | |
| Edit `config.yml`: | |
| ```yaml | |
| neo4j: | |
| uri: "bolt://localhost:7687" | |
| password: "your_password" | |
| gdc: | |
| download_dir: "./data/gdc" | |
| max_retries: 3 | |
| pipeline: | |
| fastq: | |
| quality_threshold: 25 # Increase quality threshold | |
| min_length: 75 # Increase minimum read length | |
| blast: | |
| evalue: 0.0001 # More stringent e-value | |
| num_threads: 8 # Use more CPU cores | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| ### Neo4j Connection Issues | |
| ```bash | |
| # Check Neo4j status | |
| docker ps | grep neo4j | |
| # Restart Neo4j | |
| docker-compose restart neo4j | |
| # View Neo4j logs | |
| docker-compose logs neo4j | |
| ``` | |
| ### Memory Issues | |
| Increase Docker memory allocation: | |
| 1. Open Docker Desktop Settings | |
| 2. Resources β Memory | |
| 3. Increase to at least 8GB | |
| 4. Click "Apply & Restart" | |
| ### API Errors | |
| Check logs: | |
| ```bash | |
| # View application logs | |
| cat logs/cancer_at_home.log | |
| # Follow logs in real-time | |
| tail -f logs/cancer_at_home.log | |
| ``` | |
| --- | |
| ## Best Practices | |
| 1. **Data Management**: Regularly clean up downloaded data to free space | |
| 2. **Task Monitoring**: Check BOINC tasks periodically for failures | |
| 3. **Database Backup**: Backup Neo4j data volume regularly | |
| 4. **Resource Limits**: Monitor system resources when running large analyses | |
| 5. **API Rate Limits**: Be mindful of GDC API rate limits for bulk downloads | |
| --- | |
| ## Support & Resources | |
| - **Documentation**: See README.md and QUICKSTART.md | |
| - **API Reference**: http://localhost:5000/docs | |
| - **GraphQL Examples**: See GRAPHQL_EXAMPLES.md | |
| - **Logs**: Check `logs/cancer_at_home.log` | |
| ### Useful Cypher Queries | |
| **Most common mutations:** | |
| ```cypher | |
| MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient) | |
| WITH m, count(p) as patient_count | |
| RETURN m.mutation_id, patient_count | |
| ORDER BY patient_count DESC | |
| LIMIT 10 | |
| ``` | |
| **Genes with most mutations:** | |
| ```cypher | |
| MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation) | |
| WITH g, count(m) as mutation_count | |
| RETURN g.symbol, mutation_count | |
| ORDER BY mutation_count DESC | |
| LIMIT 10 | |
| ``` | |
| **Patient mutation profile:** | |
| ```cypher | |
| MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene) | |
| RETURN g.symbol, m.consequence, m.position | |
| ``` | |