Cancer@Home v2 - User Guide
Table of Contents
- Introduction
- System Architecture
- Getting Started
- Dashboard Guide
- Working with Data
- Analysis Pipeline
- Advanced Usage
Introduction
Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:
- BOINC: Distributed computing for computationally intensive tasks
- GDC Portal: Access to comprehensive cancer genomics datasets
- Neo4j: Graph database for modeling complex relationships
- Bioinformatics Pipeline: FASTQ processing, BLAST alignment, and variant calling
Key Features
β Interactive web dashboard
β Real-time graph visualization
β GraphQL API for flexible data queries
β Distributed task processing
β Cancer genomics data integration
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Web Dashboard (Port 5000) β
β Dashboard | Neo4j Viz | BOINC | GDC | Pipelineβ
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ
β FastAPI Backend (REST + GraphQL) β
βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββββββββββ
β β β β β
ββββ΄ββ ββββ΄ββ ββββ΄ββ ββββ΄ββ ββββ΄βββββββ
βNeo4jβ βBOINCβ βGDC β βFASTQβ βBLAST/VCFβ
β7687 β βClientβ βAPI β βProc β β Caller β
βββββββ βββββββ ββββββ βββββββ βββββββββββ
Getting Started
Quick Installation (5 minutes)
Windows:
.\setup.ps1
python run.py
Linux/Mac:
./setup.sh
python run.py
Access Points
- Main Application: http://localhost:5000
- API Documentation: http://localhost:5000/docs
- GraphQL Playground: http://localhost:5000/graphql
- Neo4j Browser: http://localhost:7474 (neo4j/cancer123)
Dashboard Guide
1. Overview Tab
Shows key statistics:
- Total genes in database
- Total mutations identified
- Number of patients
- Cancer types catalogued
Chart: Mutation distribution across cancer types
2. Neo4j Visualization Tab
Interactive graph showing:
- Blue nodes: Genes (TP53, BRCA1, KRAS, etc.)
- Purple nodes: Patients
- Pink nodes: Cancer types
- Lines: Relationships between entities
Navigation:
- Click and drag nodes to rearrange
- Hover over nodes for details
- Zoom in/out with mouse wheel
3. BOINC Tasks Tab
Manage distributed computing workloads:
Submit Task:
- Select task type (Variant Calling, BLAST, Alignment)
- Enter input file path
- Click "Submit Task"
Monitor Tasks:
- View all tasks with status (Pending, Running, Completed)
- See task creation time and type
- Check overall statistics
4. GDC Data Tab
Browse available cancer projects:
- TCGA-BRCA: Breast Cancer (1,098 cases)
- TCGA-LUAD: Lung Adenocarcinoma (585 cases)
- TCGA-COAD: Colon Adenocarcinoma (461 cases)
- TCGA-GBM: Glioblastoma (617 cases)
- TARGET-AML: Acute Myeloid Leukemia (238 cases)
Click on a project to explore available datasets.
5. Pipeline Tab
Quick access to bioinformatics tools:
- FASTQ QC: Quality control for sequencing data
- BLAST Search: Sequence alignment and homology
- Variant Calling: Identify genetic variants
Working with Data
Querying with GraphQL
Access the GraphQL playground at http://localhost:5000/graphql
Example 1: Find mutations in TP53 gene
query {
mutations(gene: "TP53") {
mutation_id
chromosome
position
consequence
}
}
Example 2: Get patient information
query {
patients(project_id: "TCGA-BRCA", limit: 10) {
patient_id
age
gender
vital_status
}
}
Example 3: Cancer statistics
query {
cancerStatistics(cancer_type_id: "BRCA") {
total_patients
total_mutations
avg_mutations_per_patient
}
}
Using the REST API
Get database summary:
curl http://localhost:5000/api/neo4j/summary
Search GDC files:
curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"
Submit BOINC task:
curl -X POST http://localhost:5000/api/boinc/submit \
-H "Content-Type: application/json" \
-d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'
Analysis Pipeline
1. FASTQ Processing
Quality Control:
from backend.pipeline import FASTQProcessor
processor = FASTQProcessor()
stats = processor.calculate_statistics("input.fastq")
print(f"Total reads: {stats['total_reads']}")
print(f"Average quality: {stats['avg_quality']}")
Filter by quality:
filtered = processor.quality_filter("input.fastq", "filtered.fastq")
print(f"Pass rate: {filtered['pass_rate']:.2%}")
2. BLAST Alignment
Run BLAST search:
from backend.pipeline import BLASTRunner
blast = BLASTRunner()
results = blast.run_blastn("query.fasta")
hits = blast.parse_results(results)
print(f"Found {len(hits)} alignments")
Filter high-quality hits:
filtered_hits = blast.filter_hits(hits, min_identity=0.95)
3. Variant Calling
Identify variants:
from backend.pipeline import VariantCaller
caller = VariantCaller()
vcf_file = caller.call_variants("alignment.bam", "reference.fa")
variants = caller.filter_variants(vcf_file, min_quality=30)
print(f"Identified {len(variants)} high-quality variants")
Find cancer-associated variants:
from backend.pipeline import VariantAnalyzer
analyzer = VariantAnalyzer()
cancer_variants = analyzer.identify_cancer_variants(variants)
tmb = analyzer.calculate_mutation_burden(variants)
print(f"Cancer variants: {len(cancer_variants)}")
print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")
Advanced Usage
Custom Neo4j Queries
Direct Cypher queries:
from backend.neo4j import DatabaseManager
db = DatabaseManager()
# Find patients with TP53 mutations
query = """
MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})
RETURN p.patient_id, m.position, m.consequence
"""
results = db.execute_query(query)
for result in results:
print(result)
db.close()
Batch Data Import
Import GDC data:
from backend.gdc import GDCClient
from backend.neo4j import DataImporter
# Download mutation data
gdc = GDCClient()
files = gdc.get_mutation_data("TCGA-BRCA", limit=10)
for file in files:
gdc.download_file(file.file_id)
# Import to Neo4j
importer = DataImporter()
importer.import_gdc_data(files)
Custom BOINC Tasks
Submit custom analysis:
from backend.boinc import BOINCClient
client = BOINCClient()
# Submit multiple tasks
input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]
task_ids = []
for file in input_files:
task_id = client.submit_task("variant_calling", file)
task_ids.append(task_id)
# Monitor progress
for task_id in task_ids:
status = client.get_task_status(task_id)
print(f"Task {task_id}: {status.status}")
Configuration Customization
Edit config.yml:
neo4j:
uri: "bolt://localhost:7687"
password: "your_password"
gdc:
download_dir: "./data/gdc"
max_retries: 3
pipeline:
fastq:
quality_threshold: 25 # Increase quality threshold
min_length: 75 # Increase minimum read length
blast:
evalue: 0.0001 # More stringent e-value
num_threads: 8 # Use more CPU cores
Troubleshooting
Neo4j Connection Issues
# Check Neo4j status
docker ps | grep neo4j
# Restart Neo4j
docker-compose restart neo4j
# View Neo4j logs
docker-compose logs neo4j
Memory Issues
Increase Docker memory allocation:
- Open Docker Desktop Settings
- Resources β Memory
- Increase to at least 8GB
- Click "Apply & Restart"
API Errors
Check logs:
# View application logs
cat logs/cancer_at_home.log
# Follow logs in real-time
tail -f logs/cancer_at_home.log
Best Practices
- Data Management: Regularly clean up downloaded data to free space
- Task Monitoring: Check BOINC tasks periodically for failures
- Database Backup: Backup Neo4j data volume regularly
- Resource Limits: Monitor system resources when running large analyses
- API Rate Limits: Be mindful of GDC API rate limits for bulk downloads
Support & Resources
- Documentation: See README.md and QUICKSTART.md
- API Reference: http://localhost:5000/docs
- GraphQL Examples: See GRAPHQL_EXAMPLES.md
- Logs: Check
logs/cancer_at_home.log
Useful Cypher Queries
Most common mutations:
MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
WITH m, count(p) as patient_count
RETURN m.mutation_id, patient_count
ORDER BY patient_count DESC
LIMIT 10
Genes with most mutations:
MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)
WITH g, count(m) as mutation_count
RETURN g.symbol, mutation_count
ORDER BY mutation_count DESC
LIMIT 10
Patient mutation profile:
MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)
RETURN g.symbol, m.consequence, m.position