# Cancer@Home v2 - User Guide

## Table of Contents
1. [Introduction](#introduction)
2. [System Architecture](#system-architecture)
3. [Getting Started](#getting-started)
4. [Dashboard Guide](#dashboard-guide)
5. [Working with Data](#working-with-data)
6. [Analysis Pipeline](#analysis-pipeline)
7. [Advanced Usage](#advanced-usage)

---

## Introduction

Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:
- **BOINC**: Distributed computing for computationally intensive tasks
- **GDC Portal**: Access to comprehensive cancer genomics datasets
- **Neo4j**: Graph database for modeling complex relationships
- **Bioinformatics Pipeline**: FASTQ processing, BLAST alignment, and variant calling

### Key Features
✓ Interactive web dashboard  
✓ Real-time graph visualization  
✓ GraphQL API for flexible data queries  
✓ Distributed task processing  
✓ Cancer genomics data integration  

---

## System Architecture

```
┌─────────────────────────────────────────────────┐
│              Web Dashboard (Port 5000)          │
│  Dashboard | Neo4j Viz | BOINC | GDC | Pipeline│
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────┐
│           FastAPI Backend (REST + GraphQL)      │
└─────┬──────┬──────┬──────┬──────┬──────────────┘
      │      │      │      │      │
   ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴──────┐
   │Neo4j│ │BOINC│ │GDC │ │FASTQ│ │BLAST/VCF│
   │7687 │ │Client│ │API │ │Proc │ │ Caller  │
   └─────┘ └─────┘ └────┘ └─────┘ └─────────┘
```

---

## Getting Started

### Quick Installation (5 minutes)

**Windows:**
```powershell
.\setup.ps1
python run.py
```

**Linux/Mac:**
```bash
./setup.sh
python run.py
```

### Access Points
- **Main Application**: http://localhost:5000
- **API Documentation**: http://localhost:5000/docs
- **GraphQL Playground**: http://localhost:5000/graphql
- **Neo4j Browser**: http://localhost:7474 (neo4j/cancer123)

---

## Dashboard Guide

### 1. Overview Tab
Shows key statistics:
- Total genes in database
- Total mutations identified
- Number of patients
- Cancer types catalogued

**Chart**: Mutation distribution across cancer types

### 2. Neo4j Visualization Tab
Interactive graph showing:
- **Blue nodes**: Genes (TP53, BRCA1, KRAS, etc.)
- **Purple nodes**: Patients
- **Pink nodes**: Cancer types
- **Lines**: Relationships between entities

**Navigation**:
- Click and drag nodes to rearrange
- Hover over nodes for details
- Zoom in/out with mouse wheel

### 3. BOINC Tasks Tab
Manage distributed computing workloads:

**Submit Task**:
1. Select task type (Variant Calling, BLAST, Alignment)
2. Enter input file path
3. Click "Submit Task"

**Monitor Tasks**:
- View all tasks with status (Pending, Running, Completed)
- See task creation time and type
- Check overall statistics

### 4. GDC Data Tab
Browse available cancer projects:
- TCGA-BRCA: Breast Cancer (1,098 cases)
- TCGA-LUAD: Lung Adenocarcinoma (585 cases)
- TCGA-COAD: Colon Adenocarcinoma (461 cases)
- TCGA-GBM: Glioblastoma (617 cases)
- TARGET-AML: Acute Myeloid Leukemia (238 cases)

Click on a project to explore available datasets.

### 5. Pipeline Tab
Quick access to bioinformatics tools:
- **FASTQ QC**: Quality control for sequencing data
- **BLAST Search**: Sequence alignment and homology
- **Variant Calling**: Identify genetic variants

---

## Working with Data

### Querying with GraphQL

Access the GraphQL playground at http://localhost:5000/graphql

**Example 1: Find mutations in TP53 gene**
```graphql
query {
  mutations(gene: "TP53") {
    mutation_id
    chromosome
    position
    consequence
  }
}
```

**Example 2: Get patient information**
```graphql
query {
  patients(project_id: "TCGA-BRCA", limit: 10) {
    patient_id
    age
    gender
    vital_status
  }
}
```

**Example 3: Cancer statistics**
```graphql
query {
  cancerStatistics(cancer_type_id: "BRCA") {
    total_patients
    total_mutations
    avg_mutations_per_patient
  }
}
```

### Using the REST API

**Get database summary:**
```bash
curl http://localhost:5000/api/neo4j/summary
```

**Search GDC files:**
```bash
curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"
```

**Submit BOINC task:**
```bash
curl -X POST http://localhost:5000/api/boinc/submit \
  -H "Content-Type: application/json" \
  -d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'
```

---

## Analysis Pipeline

### 1. FASTQ Processing

**Quality Control:**
```python
from backend.pipeline import FASTQProcessor

processor = FASTQProcessor()
stats = processor.calculate_statistics("input.fastq")
print(f"Total reads: {stats['total_reads']}")
print(f"Average quality: {stats['avg_quality']}")
```

**Filter by quality:**
```python
filtered = processor.quality_filter("input.fastq", "filtered.fastq")
print(f"Pass rate: {filtered['pass_rate']:.2%}")
```

### 2. BLAST Alignment

**Run BLAST search:**
```python
from backend.pipeline import BLASTRunner

blast = BLASTRunner()
results = blast.run_blastn("query.fasta")
hits = blast.parse_results(results)

print(f"Found {len(hits)} alignments")
```

**Filter high-quality hits:**
```python
filtered_hits = blast.filter_hits(hits, min_identity=0.95)
```

### 3. Variant Calling

**Identify variants:**
```python
from backend.pipeline import VariantCaller

caller = VariantCaller()
vcf_file = caller.call_variants("alignment.bam", "reference.fa")
variants = caller.filter_variants(vcf_file, min_quality=30)

print(f"Identified {len(variants)} high-quality variants")
```

**Find cancer-associated variants:**
```python
from backend.pipeline import VariantAnalyzer

analyzer = VariantAnalyzer()
cancer_variants = analyzer.identify_cancer_variants(variants)
tmb = analyzer.calculate_mutation_burden(variants)

print(f"Cancer variants: {len(cancer_variants)}")
print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")
```

---

## Advanced Usage

### Custom Neo4j Queries

**Direct Cypher queries:**
```python
from backend.neo4j import DatabaseManager

db = DatabaseManager()

# Find patients with TP53 mutations
query = """
MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})
RETURN p.patient_id, m.position, m.consequence
"""

results = db.execute_query(query)
for result in results:
    print(result)

db.close()
```

### Batch Data Import

**Import GDC data:**
```python
from backend.gdc import GDCClient
from backend.neo4j import DataImporter

# Download mutation data
gdc = GDCClient()
files = gdc.get_mutation_data("TCGA-BRCA", limit=10)

for file in files:
    gdc.download_file(file.file_id)

# Import to Neo4j
importer = DataImporter()
importer.import_gdc_data(files)
```

### Custom BOINC Tasks

**Submit custom analysis:**
```python
from backend.boinc import BOINCClient

client = BOINCClient()

# Submit multiple tasks
input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]
task_ids = []

for file in input_files:
    task_id = client.submit_task("variant_calling", file)
    task_ids.append(task_id)

# Monitor progress
for task_id in task_ids:
    status = client.get_task_status(task_id)
    print(f"Task {task_id}: {status.status}")
```

### Configuration Customization

Edit `config.yml`:

```yaml
neo4j:
  uri: "bolt://localhost:7687"
  password: "your_password"

gdc:
  download_dir: "./data/gdc"
  max_retries: 3

pipeline:
  fastq:
    quality_threshold: 25  # Increase quality threshold
    min_length: 75         # Increase minimum read length
  
  blast:
    evalue: 0.0001         # More stringent e-value
    num_threads: 8         # Use more CPU cores
```

---

## Troubleshooting

### Neo4j Connection Issues
```bash
# Check Neo4j status
docker ps | grep neo4j

# Restart Neo4j
docker-compose restart neo4j

# View Neo4j logs
docker-compose logs neo4j
```

### Memory Issues
Increase Docker memory allocation:
1. Open Docker Desktop Settings
2. Resources → Memory
3. Increase to at least 8GB
4. Click "Apply & Restart"

### API Errors
Check logs:
```bash
# View application logs
cat logs/cancer_at_home.log

# Follow logs in real-time
tail -f logs/cancer_at_home.log
```

---

## Best Practices

1. **Data Management**: Regularly clean up downloaded data to free space
2. **Task Monitoring**: Check BOINC tasks periodically for failures
3. **Database Backup**: Backup Neo4j data volume regularly
4. **Resource Limits**: Monitor system resources when running large analyses
5. **API Rate Limits**: Be mindful of GDC API rate limits for bulk downloads

---

## Support & Resources

- **Documentation**: See README.md and QUICKSTART.md
- **API Reference**: http://localhost:5000/docs
- **GraphQL Examples**: See GRAPHQL_EXAMPLES.md
- **Logs**: Check `logs/cancer_at_home.log`

### Useful Cypher Queries

**Most common mutations:**
```cypher
MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
WITH m, count(p) as patient_count
RETURN m.mutation_id, patient_count
ORDER BY patient_count DESC
LIMIT 10
```

**Genes with most mutations:**
```cypher
MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)
WITH g, count(m) as mutation_count
RETURN g.symbol, mutation_count
ORDER BY mutation_count DESC
LIMIT 10
```

**Patient mutation profile:**
```cypher
MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)
RETURN g.symbol, m.consequence, m.position
```