CancerAtHomeV2 / USER_GUIDE.md
Mentors4EDU's picture
Upload 33 files
7a92197 verified

Cancer@Home v2 - User Guide

Table of Contents

  1. Introduction
  2. System Architecture
  3. Getting Started
  4. Dashboard Guide
  5. Working with Data
  6. Analysis Pipeline
  7. Advanced Usage

Introduction

Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:

  • BOINC: Distributed computing for computationally intensive tasks
  • GDC Portal: Access to comprehensive cancer genomics datasets
  • Neo4j: Graph database for modeling complex relationships
  • Bioinformatics Pipeline: FASTQ processing, BLAST alignment, and variant calling

Key Features

βœ“ Interactive web dashboard
βœ“ Real-time graph visualization
βœ“ GraphQL API for flexible data queries
βœ“ Distributed task processing
βœ“ Cancer genomics data integration


System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Web Dashboard (Port 5000)          β”‚
β”‚  Dashboard | Neo4j Viz | BOINC | GDC | Pipelineβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           FastAPI Backend (REST + GraphQL)      β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚      β”‚      β”‚      β”‚      β”‚
   β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
   β”‚Neo4jβ”‚ β”‚BOINCβ”‚ β”‚GDC β”‚ β”‚FASTQβ”‚ β”‚BLAST/VCFβ”‚
   β”‚7687 β”‚ β”‚Clientβ”‚ β”‚API β”‚ β”‚Proc β”‚ β”‚ Caller  β”‚
   β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Getting Started

Quick Installation (5 minutes)

Windows:

.\setup.ps1
python run.py

Linux/Mac:

./setup.sh
python run.py

Access Points


Dashboard Guide

1. Overview Tab

Shows key statistics:

  • Total genes in database
  • Total mutations identified
  • Number of patients
  • Cancer types catalogued

Chart: Mutation distribution across cancer types

2. Neo4j Visualization Tab

Interactive graph showing:

  • Blue nodes: Genes (TP53, BRCA1, KRAS, etc.)
  • Purple nodes: Patients
  • Pink nodes: Cancer types
  • Lines: Relationships between entities

Navigation:

  • Click and drag nodes to rearrange
  • Hover over nodes for details
  • Zoom in/out with mouse wheel

3. BOINC Tasks Tab

Manage distributed computing workloads:

Submit Task:

  1. Select task type (Variant Calling, BLAST, Alignment)
  2. Enter input file path
  3. Click "Submit Task"

Monitor Tasks:

  • View all tasks with status (Pending, Running, Completed)
  • See task creation time and type
  • Check overall statistics

4. GDC Data Tab

Browse available cancer projects:

  • TCGA-BRCA: Breast Cancer (1,098 cases)
  • TCGA-LUAD: Lung Adenocarcinoma (585 cases)
  • TCGA-COAD: Colon Adenocarcinoma (461 cases)
  • TCGA-GBM: Glioblastoma (617 cases)
  • TARGET-AML: Acute Myeloid Leukemia (238 cases)

Click on a project to explore available datasets.

5. Pipeline Tab

Quick access to bioinformatics tools:

  • FASTQ QC: Quality control for sequencing data
  • BLAST Search: Sequence alignment and homology
  • Variant Calling: Identify genetic variants

Working with Data

Querying with GraphQL

Access the GraphQL playground at http://localhost:5000/graphql

Example 1: Find mutations in TP53 gene

query {
  mutations(gene: "TP53") {
    mutation_id
    chromosome
    position
    consequence
  }
}

Example 2: Get patient information

query {
  patients(project_id: "TCGA-BRCA", limit: 10) {
    patient_id
    age
    gender
    vital_status
  }
}

Example 3: Cancer statistics

query {
  cancerStatistics(cancer_type_id: "BRCA") {
    total_patients
    total_mutations
    avg_mutations_per_patient
  }
}

Using the REST API

Get database summary:

curl http://localhost:5000/api/neo4j/summary

Search GDC files:

curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"

Submit BOINC task:

curl -X POST http://localhost:5000/api/boinc/submit \
  -H "Content-Type: application/json" \
  -d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'

Analysis Pipeline

1. FASTQ Processing

Quality Control:

from backend.pipeline import FASTQProcessor

processor = FASTQProcessor()
stats = processor.calculate_statistics("input.fastq")
print(f"Total reads: {stats['total_reads']}")
print(f"Average quality: {stats['avg_quality']}")

Filter by quality:

filtered = processor.quality_filter("input.fastq", "filtered.fastq")
print(f"Pass rate: {filtered['pass_rate']:.2%}")

2. BLAST Alignment

Run BLAST search:

from backend.pipeline import BLASTRunner

blast = BLASTRunner()
results = blast.run_blastn("query.fasta")
hits = blast.parse_results(results)

print(f"Found {len(hits)} alignments")

Filter high-quality hits:

filtered_hits = blast.filter_hits(hits, min_identity=0.95)

3. Variant Calling

Identify variants:

from backend.pipeline import VariantCaller

caller = VariantCaller()
vcf_file = caller.call_variants("alignment.bam", "reference.fa")
variants = caller.filter_variants(vcf_file, min_quality=30)

print(f"Identified {len(variants)} high-quality variants")

Find cancer-associated variants:

from backend.pipeline import VariantAnalyzer

analyzer = VariantAnalyzer()
cancer_variants = analyzer.identify_cancer_variants(variants)
tmb = analyzer.calculate_mutation_burden(variants)

print(f"Cancer variants: {len(cancer_variants)}")
print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")

Advanced Usage

Custom Neo4j Queries

Direct Cypher queries:

from backend.neo4j import DatabaseManager

db = DatabaseManager()

# Find patients with TP53 mutations
query = """
MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})
RETURN p.patient_id, m.position, m.consequence
"""

results = db.execute_query(query)
for result in results:
    print(result)

db.close()

Batch Data Import

Import GDC data:

from backend.gdc import GDCClient
from backend.neo4j import DataImporter

# Download mutation data
gdc = GDCClient()
files = gdc.get_mutation_data("TCGA-BRCA", limit=10)

for file in files:
    gdc.download_file(file.file_id)

# Import to Neo4j
importer = DataImporter()
importer.import_gdc_data(files)

Custom BOINC Tasks

Submit custom analysis:

from backend.boinc import BOINCClient

client = BOINCClient()

# Submit multiple tasks
input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]
task_ids = []

for file in input_files:
    task_id = client.submit_task("variant_calling", file)
    task_ids.append(task_id)

# Monitor progress
for task_id in task_ids:
    status = client.get_task_status(task_id)
    print(f"Task {task_id}: {status.status}")

Configuration Customization

Edit config.yml:

neo4j:
  uri: "bolt://localhost:7687"
  password: "your_password"

gdc:
  download_dir: "./data/gdc"
  max_retries: 3

pipeline:
  fastq:
    quality_threshold: 25  # Increase quality threshold
    min_length: 75         # Increase minimum read length
  
  blast:
    evalue: 0.0001         # More stringent e-value
    num_threads: 8         # Use more CPU cores

Troubleshooting

Neo4j Connection Issues

# Check Neo4j status
docker ps | grep neo4j

# Restart Neo4j
docker-compose restart neo4j

# View Neo4j logs
docker-compose logs neo4j

Memory Issues

Increase Docker memory allocation:

  1. Open Docker Desktop Settings
  2. Resources β†’ Memory
  3. Increase to at least 8GB
  4. Click "Apply & Restart"

API Errors

Check logs:

# View application logs
cat logs/cancer_at_home.log

# Follow logs in real-time
tail -f logs/cancer_at_home.log

Best Practices

  1. Data Management: Regularly clean up downloaded data to free space
  2. Task Monitoring: Check BOINC tasks periodically for failures
  3. Database Backup: Backup Neo4j data volume regularly
  4. Resource Limits: Monitor system resources when running large analyses
  5. API Rate Limits: Be mindful of GDC API rate limits for bulk downloads

Support & Resources

  • Documentation: See README.md and QUICKSTART.md
  • API Reference: http://localhost:5000/docs
  • GraphQL Examples: See GRAPHQL_EXAMPLES.md
  • Logs: Check logs/cancer_at_home.log

Useful Cypher Queries

Most common mutations:

MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
WITH m, count(p) as patient_count
RETURN m.mutation_id, patient_count
ORDER BY patient_count DESC
LIMIT 10

Genes with most mutations:

MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)
WITH g, count(m) as mutation_count
RETURN g.symbol, mutation_count
ORDER BY mutation_count DESC
LIMIT 10

Patient mutation profile:

MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)
RETURN g.symbol, m.consequence, m.position