# Cancer@Home v2 - User Guide ## Table of Contents 1. [Introduction](#introduction) 2. [System Architecture](#system-architecture) 3. [Getting Started](#getting-started) 4. [Dashboard Guide](#dashboard-guide) 5. [Working with Data](#working-with-data) 6. [Analysis Pipeline](#analysis-pipeline) 7. [Advanced Usage](#advanced-usage) --- ## Introduction Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines: - **BOINC**: Distributed computing for computationally intensive tasks - **GDC Portal**: Access to comprehensive cancer genomics datasets - **Neo4j**: Graph database for modeling complex relationships - **Bioinformatics Pipeline**: FASTQ processing, BLAST alignment, and variant calling ### Key Features ✓ Interactive web dashboard ✓ Real-time graph visualization ✓ GraphQL API for flexible data queries ✓ Distributed task processing ✓ Cancer genomics data integration --- ## System Architecture ``` ┌─────────────────────────────────────────────────┐ │ Web Dashboard (Port 5000) │ │ Dashboard | Neo4j Viz | BOINC | GDC | Pipeline│ └────────────────────┬────────────────────────────┘ │ ┌────────────────────┴────────────────────────────┐ │ FastAPI Backend (REST + GraphQL) │ └─────┬──────┬──────┬──────┬──────┬──────────────┘ │ │ │ │ │ ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴──────┐ │Neo4j│ │BOINC│ │GDC │ │FASTQ│ │BLAST/VCF│ │7687 │ │Client│ │API │ │Proc │ │ Caller │ └─────┘ └─────┘ └────┘ └─────┘ └─────────┘ ``` --- ## Getting Started ### Quick Installation (5 minutes) **Windows:** ```powershell .\setup.ps1 python run.py ``` **Linux/Mac:** ```bash ./setup.sh python run.py ``` ### Access Points - **Main Application**: http://localhost:5000 - **API Documentation**: http://localhost:5000/docs - **GraphQL Playground**: http://localhost:5000/graphql - **Neo4j Browser**: http://localhost:7474 (neo4j/cancer123) --- ## Dashboard Guide ### 1. Overview Tab Shows key statistics: - Total genes in database - Total mutations identified - Number of patients - Cancer types catalogued **Chart**: Mutation distribution across cancer types ### 2. Neo4j Visualization Tab Interactive graph showing: - **Blue nodes**: Genes (TP53, BRCA1, KRAS, etc.) - **Purple nodes**: Patients - **Pink nodes**: Cancer types - **Lines**: Relationships between entities **Navigation**: - Click and drag nodes to rearrange - Hover over nodes for details - Zoom in/out with mouse wheel ### 3. BOINC Tasks Tab Manage distributed computing workloads: **Submit Task**: 1. Select task type (Variant Calling, BLAST, Alignment) 2. Enter input file path 3. Click "Submit Task" **Monitor Tasks**: - View all tasks with status (Pending, Running, Completed) - See task creation time and type - Check overall statistics ### 4. GDC Data Tab Browse available cancer projects: - TCGA-BRCA: Breast Cancer (1,098 cases) - TCGA-LUAD: Lung Adenocarcinoma (585 cases) - TCGA-COAD: Colon Adenocarcinoma (461 cases) - TCGA-GBM: Glioblastoma (617 cases) - TARGET-AML: Acute Myeloid Leukemia (238 cases) Click on a project to explore available datasets. ### 5. Pipeline Tab Quick access to bioinformatics tools: - **FASTQ QC**: Quality control for sequencing data - **BLAST Search**: Sequence alignment and homology - **Variant Calling**: Identify genetic variants --- ## Working with Data ### Querying with GraphQL Access the GraphQL playground at http://localhost:5000/graphql **Example 1: Find mutations in TP53 gene** ```graphql query { mutations(gene: "TP53") { mutation_id chromosome position consequence } } ``` **Example 2: Get patient information** ```graphql query { patients(project_id: "TCGA-BRCA", limit: 10) { patient_id age gender vital_status } } ``` **Example 3: Cancer statistics** ```graphql query { cancerStatistics(cancer_type_id: "BRCA") { total_patients total_mutations avg_mutations_per_patient } } ``` ### Using the REST API **Get database summary:** ```bash curl http://localhost:5000/api/neo4j/summary ``` **Search GDC files:** ```bash curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10" ``` **Submit BOINC task:** ```bash curl -X POST http://localhost:5000/api/boinc/submit \ -H "Content-Type: application/json" \ -d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}' ``` --- ## Analysis Pipeline ### 1. FASTQ Processing **Quality Control:** ```python from backend.pipeline import FASTQProcessor processor = FASTQProcessor() stats = processor.calculate_statistics("input.fastq") print(f"Total reads: {stats['total_reads']}") print(f"Average quality: {stats['avg_quality']}") ``` **Filter by quality:** ```python filtered = processor.quality_filter("input.fastq", "filtered.fastq") print(f"Pass rate: {filtered['pass_rate']:.2%}") ``` ### 2. BLAST Alignment **Run BLAST search:** ```python from backend.pipeline import BLASTRunner blast = BLASTRunner() results = blast.run_blastn("query.fasta") hits = blast.parse_results(results) print(f"Found {len(hits)} alignments") ``` **Filter high-quality hits:** ```python filtered_hits = blast.filter_hits(hits, min_identity=0.95) ``` ### 3. Variant Calling **Identify variants:** ```python from backend.pipeline import VariantCaller caller = VariantCaller() vcf_file = caller.call_variants("alignment.bam", "reference.fa") variants = caller.filter_variants(vcf_file, min_quality=30) print(f"Identified {len(variants)} high-quality variants") ``` **Find cancer-associated variants:** ```python from backend.pipeline import VariantAnalyzer analyzer = VariantAnalyzer() cancer_variants = analyzer.identify_cancer_variants(variants) tmb = analyzer.calculate_mutation_burden(variants) print(f"Cancer variants: {len(cancer_variants)}") print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb") ``` --- ## Advanced Usage ### Custom Neo4j Queries **Direct Cypher queries:** ```python from backend.neo4j import DatabaseManager db = DatabaseManager() # Find patients with TP53 mutations query = """ MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'}) RETURN p.patient_id, m.position, m.consequence """ results = db.execute_query(query) for result in results: print(result) db.close() ``` ### Batch Data Import **Import GDC data:** ```python from backend.gdc import GDCClient from backend.neo4j import DataImporter # Download mutation data gdc = GDCClient() files = gdc.get_mutation_data("TCGA-BRCA", limit=10) for file in files: gdc.download_file(file.file_id) # Import to Neo4j importer = DataImporter() importer.import_gdc_data(files) ``` ### Custom BOINC Tasks **Submit custom analysis:** ```python from backend.boinc import BOINCClient client = BOINCClient() # Submit multiple tasks input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"] task_ids = [] for file in input_files: task_id = client.submit_task("variant_calling", file) task_ids.append(task_id) # Monitor progress for task_id in task_ids: status = client.get_task_status(task_id) print(f"Task {task_id}: {status.status}") ``` ### Configuration Customization Edit `config.yml`: ```yaml neo4j: uri: "bolt://localhost:7687" password: "your_password" gdc: download_dir: "./data/gdc" max_retries: 3 pipeline: fastq: quality_threshold: 25 # Increase quality threshold min_length: 75 # Increase minimum read length blast: evalue: 0.0001 # More stringent e-value num_threads: 8 # Use more CPU cores ``` --- ## Troubleshooting ### Neo4j Connection Issues ```bash # Check Neo4j status docker ps | grep neo4j # Restart Neo4j docker-compose restart neo4j # View Neo4j logs docker-compose logs neo4j ``` ### Memory Issues Increase Docker memory allocation: 1. Open Docker Desktop Settings 2. Resources → Memory 3. Increase to at least 8GB 4. Click "Apply & Restart" ### API Errors Check logs: ```bash # View application logs cat logs/cancer_at_home.log # Follow logs in real-time tail -f logs/cancer_at_home.log ``` --- ## Best Practices 1. **Data Management**: Regularly clean up downloaded data to free space 2. **Task Monitoring**: Check BOINC tasks periodically for failures 3. **Database Backup**: Backup Neo4j data volume regularly 4. **Resource Limits**: Monitor system resources when running large analyses 5. **API Rate Limits**: Be mindful of GDC API rate limits for bulk downloads --- ## Support & Resources - **Documentation**: See README.md and QUICKSTART.md - **API Reference**: http://localhost:5000/docs - **GraphQL Examples**: See GRAPHQL_EXAMPLES.md - **Logs**: Check `logs/cancer_at_home.log` ### Useful Cypher Queries **Most common mutations:** ```cypher MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient) WITH m, count(p) as patient_count RETURN m.mutation_id, patient_count ORDER BY patient_count DESC LIMIT 10 ``` **Genes with most mutations:** ```cypher MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation) WITH g, count(m) as mutation_count RETURN g.symbol, mutation_count ORDER BY mutation_count DESC LIMIT 10 ``` **Patient mutation profile:** ```cypher MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene) RETURN g.symbol, m.consequence, m.position ```