CancerAtHomeV2 / USER_GUIDE.md
Mentors4EDU's picture
Upload 33 files
7a92197 verified
# Cancer@Home v2 - User Guide
## Table of Contents
1. [Introduction](#introduction)
2. [System Architecture](#system-architecture)
3. [Getting Started](#getting-started)
4. [Dashboard Guide](#dashboard-guide)
5. [Working with Data](#working-with-data)
6. [Analysis Pipeline](#analysis-pipeline)
7. [Advanced Usage](#advanced-usage)
---
## Introduction
Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:
- **BOINC**: Distributed computing for computationally intensive tasks
- **GDC Portal**: Access to comprehensive cancer genomics datasets
- **Neo4j**: Graph database for modeling complex relationships
- **Bioinformatics Pipeline**: FASTQ processing, BLAST alignment, and variant calling
### Key Features
βœ“ Interactive web dashboard
βœ“ Real-time graph visualization
βœ“ GraphQL API for flexible data queries
βœ“ Distributed task processing
βœ“ Cancer genomics data integration
---
## System Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Web Dashboard (Port 5000) β”‚
β”‚ Dashboard | Neo4j Viz | BOINC | GDC | Pipelineβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI Backend (REST + GraphQL) β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚ β”‚ β”‚
β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
β”‚Neo4jβ”‚ β”‚BOINCβ”‚ β”‚GDC β”‚ β”‚FASTQβ”‚ β”‚BLAST/VCFβ”‚
β”‚7687 β”‚ β”‚Clientβ”‚ β”‚API β”‚ β”‚Proc β”‚ β”‚ Caller β”‚
β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Getting Started
### Quick Installation (5 minutes)
**Windows:**
```powershell
.\setup.ps1
python run.py
```
**Linux/Mac:**
```bash
./setup.sh
python run.py
```
### Access Points
- **Main Application**: http://localhost:5000
- **API Documentation**: http://localhost:5000/docs
- **GraphQL Playground**: http://localhost:5000/graphql
- **Neo4j Browser**: http://localhost:7474 (neo4j/cancer123)
---
## Dashboard Guide
### 1. Overview Tab
Shows key statistics:
- Total genes in database
- Total mutations identified
- Number of patients
- Cancer types catalogued
**Chart**: Mutation distribution across cancer types
### 2. Neo4j Visualization Tab
Interactive graph showing:
- **Blue nodes**: Genes (TP53, BRCA1, KRAS, etc.)
- **Purple nodes**: Patients
- **Pink nodes**: Cancer types
- **Lines**: Relationships between entities
**Navigation**:
- Click and drag nodes to rearrange
- Hover over nodes for details
- Zoom in/out with mouse wheel
### 3. BOINC Tasks Tab
Manage distributed computing workloads:
**Submit Task**:
1. Select task type (Variant Calling, BLAST, Alignment)
2. Enter input file path
3. Click "Submit Task"
**Monitor Tasks**:
- View all tasks with status (Pending, Running, Completed)
- See task creation time and type
- Check overall statistics
### 4. GDC Data Tab
Browse available cancer projects:
- TCGA-BRCA: Breast Cancer (1,098 cases)
- TCGA-LUAD: Lung Adenocarcinoma (585 cases)
- TCGA-COAD: Colon Adenocarcinoma (461 cases)
- TCGA-GBM: Glioblastoma (617 cases)
- TARGET-AML: Acute Myeloid Leukemia (238 cases)
Click on a project to explore available datasets.
### 5. Pipeline Tab
Quick access to bioinformatics tools:
- **FASTQ QC**: Quality control for sequencing data
- **BLAST Search**: Sequence alignment and homology
- **Variant Calling**: Identify genetic variants
---
## Working with Data
### Querying with GraphQL
Access the GraphQL playground at http://localhost:5000/graphql
**Example 1: Find mutations in TP53 gene**
```graphql
query {
mutations(gene: "TP53") {
mutation_id
chromosome
position
consequence
}
}
```
**Example 2: Get patient information**
```graphql
query {
patients(project_id: "TCGA-BRCA", limit: 10) {
patient_id
age
gender
vital_status
}
}
```
**Example 3: Cancer statistics**
```graphql
query {
cancerStatistics(cancer_type_id: "BRCA") {
total_patients
total_mutations
avg_mutations_per_patient
}
}
```
### Using the REST API
**Get database summary:**
```bash
curl http://localhost:5000/api/neo4j/summary
```
**Search GDC files:**
```bash
curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"
```
**Submit BOINC task:**
```bash
curl -X POST http://localhost:5000/api/boinc/submit \
-H "Content-Type: application/json" \
-d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'
```
---
## Analysis Pipeline
### 1. FASTQ Processing
**Quality Control:**
```python
from backend.pipeline import FASTQProcessor
processor = FASTQProcessor()
stats = processor.calculate_statistics("input.fastq")
print(f"Total reads: {stats['total_reads']}")
print(f"Average quality: {stats['avg_quality']}")
```
**Filter by quality:**
```python
filtered = processor.quality_filter("input.fastq", "filtered.fastq")
print(f"Pass rate: {filtered['pass_rate']:.2%}")
```
### 2. BLAST Alignment
**Run BLAST search:**
```python
from backend.pipeline import BLASTRunner
blast = BLASTRunner()
results = blast.run_blastn("query.fasta")
hits = blast.parse_results(results)
print(f"Found {len(hits)} alignments")
```
**Filter high-quality hits:**
```python
filtered_hits = blast.filter_hits(hits, min_identity=0.95)
```
### 3. Variant Calling
**Identify variants:**
```python
from backend.pipeline import VariantCaller
caller = VariantCaller()
vcf_file = caller.call_variants("alignment.bam", "reference.fa")
variants = caller.filter_variants(vcf_file, min_quality=30)
print(f"Identified {len(variants)} high-quality variants")
```
**Find cancer-associated variants:**
```python
from backend.pipeline import VariantAnalyzer
analyzer = VariantAnalyzer()
cancer_variants = analyzer.identify_cancer_variants(variants)
tmb = analyzer.calculate_mutation_burden(variants)
print(f"Cancer variants: {len(cancer_variants)}")
print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")
```
---
## Advanced Usage
### Custom Neo4j Queries
**Direct Cypher queries:**
```python
from backend.neo4j import DatabaseManager
db = DatabaseManager()
# Find patients with TP53 mutations
query = """
MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})
RETURN p.patient_id, m.position, m.consequence
"""
results = db.execute_query(query)
for result in results:
print(result)
db.close()
```
### Batch Data Import
**Import GDC data:**
```python
from backend.gdc import GDCClient
from backend.neo4j import DataImporter
# Download mutation data
gdc = GDCClient()
files = gdc.get_mutation_data("TCGA-BRCA", limit=10)
for file in files:
gdc.download_file(file.file_id)
# Import to Neo4j
importer = DataImporter()
importer.import_gdc_data(files)
```
### Custom BOINC Tasks
**Submit custom analysis:**
```python
from backend.boinc import BOINCClient
client = BOINCClient()
# Submit multiple tasks
input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]
task_ids = []
for file in input_files:
task_id = client.submit_task("variant_calling", file)
task_ids.append(task_id)
# Monitor progress
for task_id in task_ids:
status = client.get_task_status(task_id)
print(f"Task {task_id}: {status.status}")
```
### Configuration Customization
Edit `config.yml`:
```yaml
neo4j:
uri: "bolt://localhost:7687"
password: "your_password"
gdc:
download_dir: "./data/gdc"
max_retries: 3
pipeline:
fastq:
quality_threshold: 25 # Increase quality threshold
min_length: 75 # Increase minimum read length
blast:
evalue: 0.0001 # More stringent e-value
num_threads: 8 # Use more CPU cores
```
---
## Troubleshooting
### Neo4j Connection Issues
```bash
# Check Neo4j status
docker ps | grep neo4j
# Restart Neo4j
docker-compose restart neo4j
# View Neo4j logs
docker-compose logs neo4j
```
### Memory Issues
Increase Docker memory allocation:
1. Open Docker Desktop Settings
2. Resources β†’ Memory
3. Increase to at least 8GB
4. Click "Apply & Restart"
### API Errors
Check logs:
```bash
# View application logs
cat logs/cancer_at_home.log
# Follow logs in real-time
tail -f logs/cancer_at_home.log
```
---
## Best Practices
1. **Data Management**: Regularly clean up downloaded data to free space
2. **Task Monitoring**: Check BOINC tasks periodically for failures
3. **Database Backup**: Backup Neo4j data volume regularly
4. **Resource Limits**: Monitor system resources when running large analyses
5. **API Rate Limits**: Be mindful of GDC API rate limits for bulk downloads
---
## Support & Resources
- **Documentation**: See README.md and QUICKSTART.md
- **API Reference**: http://localhost:5000/docs
- **GraphQL Examples**: See GRAPHQL_EXAMPLES.md
- **Logs**: Check `logs/cancer_at_home.log`
### Useful Cypher Queries
**Most common mutations:**
```cypher
MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
WITH m, count(p) as patient_count
RETURN m.mutation_id, patient_count
ORDER BY patient_count DESC
LIMIT 10
```
**Genes with most mutations:**
```cypher
MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)
WITH g, count(m) as mutation_count
RETURN g.symbol, mutation_count
ORDER BY mutation_count DESC
LIMIT 10
```
**Patient mutation profile:**
```cypher
MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)
RETURN g.symbol, m.consequence, m.position
```