CancerAtHomeV2 / USER_GUIDE.md

Mentors4EDU

Upload 33 files

7a92197 verified 23 days ago

preview code

raw

history blame contribute delete

10.4 kB

Cancer@Home v2 - User Guide

Introduction
System Architecture
Getting Started
Dashboard Guide
Working with Data
Analysis Pipeline
Advanced Usage

Introduction

Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:

BOINC: Distributed computing for computationally intensive tasks
GDC Portal: Access to comprehensive cancer genomics datasets
Neo4j: Graph database for modeling complex relationships
Bioinformatics Pipeline: FASTQ processing, BLAST alignment, and variant calling

Key Features

✓ Interactive web dashboard
✓ Real-time graph visualization
✓ GraphQL API for flexible data queries
✓ Distributed task processing
✓ Cancer genomics data integration

System Architecture

┌─────────────────────────────────────────────────┐
│              Web Dashboard (Port 5000)          │
│  Dashboard | Neo4j Viz | BOINC | GDC | Pipeline│
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────┐
│           FastAPI Backend (REST + GraphQL)      │
└─────┬──────┬──────┬──────┬──────┬──────────────┘
      │      │      │      │      │
   ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴──────┐
   │Neo4j│ │BOINC│ │GDC │ │FASTQ│ │BLAST/VCF│
   │7687 │ │Client│ │API │ │Proc │ │ Caller  │
   └─────┘ └─────┘ └────┘ └─────┘ └─────────┘

Getting Started

Quick Installation (5 minutes)

Windows:

.\setup.ps1
python run.py

Linux/Mac:

./setup.sh
python run.py

Access Points

Main Application: http://localhost:5000
API Documentation: http://localhost:5000/docs
GraphQL Playground: http://localhost:5000/graphql
Neo4j Browser: http://localhost:7474 (neo4j/cancer123)

Dashboard Guide

1. Overview Tab

Shows key statistics:

Total genes in database
Total mutations identified
Number of patients
Cancer types catalogued

Chart: Mutation distribution across cancer types

2. Neo4j Visualization Tab

Interactive graph showing:

Blue nodes: Genes (TP53, BRCA1, KRAS, etc.)
Purple nodes: Patients
Pink nodes: Cancer types
Lines: Relationships between entities

Navigation:

Click and drag nodes to rearrange
Hover over nodes for details
Zoom in/out with mouse wheel

3. BOINC Tasks Tab

Manage distributed computing workloads:

Submit Task:

Select task type (Variant Calling, BLAST, Alignment)
Enter input file path
Click "Submit Task"

Monitor Tasks:

View all tasks with status (Pending, Running, Completed)
See task creation time and type
Check overall statistics

4. GDC Data Tab

Browse available cancer projects:

TCGA-BRCA: Breast Cancer (1,098 cases)
TCGA-LUAD: Lung Adenocarcinoma (585 cases)
TCGA-COAD: Colon Adenocarcinoma (461 cases)
TCGA-GBM: Glioblastoma (617 cases)
TARGET-AML: Acute Myeloid Leukemia (238 cases)

Click on a project to explore available datasets.

5. Pipeline Tab

Quick access to bioinformatics tools:

FASTQ QC: Quality control for sequencing data
BLAST Search: Sequence alignment and homology
Variant Calling: Identify genetic variants

Working with Data

Querying with GraphQL

Access the GraphQL playground at http://localhost:5000/graphql

Example 1: Find mutations in TP53 gene

query {
  mutations(gene: "TP53") {
    mutation_id
    chromosome
    position
    consequence
  }
}

Example 2: Get patient information

query {
  patients(project_id: "TCGA-BRCA", limit: 10) {
    patient_id
    age
    gender
    vital_status
  }
}

Example 3: Cancer statistics

query {
  cancerStatistics(cancer_type_id: "BRCA") {
    total_patients
    total_mutations
    avg_mutations_per_patient
  }
}

Using the REST API

Get database summary:

curl http://localhost:5000/api/neo4j/summary

Search GDC files:

curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"

Submit BOINC task:

curl -X POST http://localhost:5000/api/boinc/submit \
  -H "Content-Type: application/json" \
  -d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'

Analysis Pipeline

1. FASTQ Processing

Quality Control:

from backend.pipeline import FASTQProcessor

processor = FASTQProcessor()
stats = processor.calculate_statistics("input.fastq")
print(f"Total reads: {stats['total_reads']}")
print(f"Average quality: {stats['avg_quality']}")

Filter by quality:

filtered = processor.quality_filter("input.fastq", "filtered.fastq")
print(f"Pass rate: {filtered['pass_rate']:.2%}")

2. BLAST Alignment

Run BLAST search:

from backend.pipeline import BLASTRunner

blast = BLASTRunner()
results = blast.run_blastn("query.fasta")
hits = blast.parse_results(results)

print(f"Found {len(hits)} alignments")

Filter high-quality hits:

filtered_hits = blast.filter_hits(hits, min_identity=0.95)

3. Variant Calling

Identify variants:

from backend.pipeline import VariantCaller

caller = VariantCaller()
vcf_file = caller.call_variants("alignment.bam", "reference.fa")
variants = caller.filter_variants(vcf_file, min_quality=30)

print(f"Identified {len(variants)} high-quality variants")

Find cancer-associated variants:

from backend.pipeline import VariantAnalyzer

analyzer = VariantAnalyzer()
cancer_variants = analyzer.identify_cancer_variants(variants)
tmb = analyzer.calculate_mutation_burden(variants)

print(f"Cancer variants: {len(cancer_variants)}")
print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")

Advanced Usage

Custom Neo4j Queries

Direct Cypher queries:

from backend.neo4j import DatabaseManager

db = DatabaseManager()

# Find patients with TP53 mutations
query = """
MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})
RETURN p.patient_id, m.position, m.consequence
"""

results = db.execute_query(query)
for result in results:
    print(result)

db.close()

Batch Data Import

Import GDC data:

from backend.gdc import GDCClient
from backend.neo4j import DataImporter

# Download mutation data
gdc = GDCClient()
files = gdc.get_mutation_data("TCGA-BRCA", limit=10)

for file in files:
    gdc.download_file(file.file_id)

# Import to Neo4j
importer = DataImporter()
importer.import_gdc_data(files)

Custom BOINC Tasks

Submit custom analysis:

from backend.boinc import BOINCClient

client = BOINCClient()

# Submit multiple tasks
input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]
task_ids = []

for file in input_files:
    task_id = client.submit_task("variant_calling", file)
    task_ids.append(task_id)

# Monitor progress
for task_id in task_ids:
    status = client.get_task_status(task_id)
    print(f"Task {task_id}: {status.status}")

Configuration Customization

Edit config.yml:

neo4j:
  uri: "bolt://localhost:7687"
  password: "your_password"

gdc:
  download_dir: "./data/gdc"
  max_retries: 3

pipeline:
  fastq:
    quality_threshold: 25  # Increase quality threshold
    min_length: 75         # Increase minimum read length
  
  blast:
    evalue: 0.0001         # More stringent e-value
    num_threads: 8         # Use more CPU cores

Troubleshooting

Neo4j Connection Issues

# Check Neo4j status
docker ps | grep neo4j

# Restart Neo4j
docker-compose restart neo4j

# View Neo4j logs
docker-compose logs neo4j

Memory Issues

Increase Docker memory allocation:

Open Docker Desktop Settings
Resources → Memory
Increase to at least 8GB
Click "Apply & Restart"

API Errors

Check logs:

# View application logs
cat logs/cancer_at_home.log

# Follow logs in real-time
tail -f logs/cancer_at_home.log

Best Practices

Data Management: Regularly clean up downloaded data to free space
Task Monitoring: Check BOINC tasks periodically for failures
Database Backup: Backup Neo4j data volume regularly
Resource Limits: Monitor system resources when running large analyses
API Rate Limits: Be mindful of GDC API rate limits for bulk downloads

Support & Resources

Documentation: See README.md and QUICKSTART.md
API Reference: http://localhost:5000/docs
GraphQL Examples: See GRAPHQL_EXAMPLES.md
Logs: Check logs/cancer_at_home.log

Useful Cypher Queries

Most common mutations:

MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
WITH m, count(p) as patient_count
RETURN m.mutation_id, patient_count
ORDER BY patient_count DESC
LIMIT 10

Genes with most mutations:

MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)
WITH g, count(m) as mutation_count
RETURN g.symbol, mutation_count
ORDER BY mutation_count DESC
LIMIT 10

Patient mutation profile:

MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)
RETURN g.symbol, m.consequence, m.position

OpenPeerAI
/

CancerAtHomeV2