USER_GUIDE.md · OpenPeerAI/CancerAtHomeV2 at main

CancerAtHomeV2 / USER_GUIDE.md

Mentors4EDU

Upload 33 files

7a92197 verified 29 days ago

preview code

raw

history blame contribute delete

10.4 kB

	# Cancer@Home v2 - User Guide

	## Table of Contents
	1. [Introduction](#introduction)
	2. [System Architecture](#system-architecture)
	3. [Getting Started](#getting-started)
	4. [Dashboard Guide](#dashboard-guide)
	5. [Working with Data](#working-with-data)
	6. [Analysis Pipeline](#analysis-pipeline)
	7. [Advanced Usage](#advanced-usage)

	---

	## Introduction

	Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:
	- BOINC: Distributed computing for computationally intensive tasks
	- GDC Portal: Access to comprehensive cancer genomics datasets
	- Neo4j: Graph database for modeling complex relationships
	- Bioinformatics Pipeline: FASTQ processing, BLAST alignment, and variant calling

	### Key Features
	✓ Interactive web dashboard
	✓ Real-time graph visualization
	✓ GraphQL API for flexible data queries
	✓ Distributed task processing
	✓ Cancer genomics data integration

	---

	## System Architecture

	```
	┌─────────────────────────────────────────────────┐
	│ Web Dashboard (Port 5000) │
	│ Dashboard \| Neo4j Viz \| BOINC \| GDC \| Pipeline│
	└────────────────────┬────────────────────────────┘
	│
	┌────────────────────┴────────────────────────────┐
	│ FastAPI Backend (REST + GraphQL) │
	└─────┬──────┬──────┬──────┬──────┬──────────────┘
	│ │ │ │ │
	┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴──────┐
	│Neo4j│ │BOINC│ │GDC │ │FASTQ│ │BLAST/VCF│
	│7687 │ │Client│ │API │ │Proc │ │ Caller │
	└─────┘ └─────┘ └────┘ └─────┘ └─────────┘
	```

	---

	## Getting Started

	### Quick Installation (5 minutes)

	Windows:
	```powershell
	.\setup.ps1
	python run.py
	```

	Linux/Mac:
	```bash
	./setup.sh
	python run.py
	```

	### Access Points
	- Main Application: http://localhost:5000
	- API Documentation: http://localhost:5000/docs
	- GraphQL Playground: http://localhost:5000/graphql
	- Neo4j Browser: http://localhost:7474 (neo4j/cancer123)

	---

	## Dashboard Guide

	### 1. Overview Tab
	Shows key statistics:
	- Total genes in database
	- Total mutations identified
	- Number of patients
	- Cancer types catalogued

	Chart: Mutation distribution across cancer types

	### 2. Neo4j Visualization Tab
	Interactive graph showing:
	- Blue nodes: Genes (TP53, BRCA1, KRAS, etc.)
	- Purple nodes: Patients
	- Pink nodes: Cancer types
	- Lines: Relationships between entities

	Navigation:
	- Click and drag nodes to rearrange
	- Hover over nodes for details
	- Zoom in/out with mouse wheel

	### 3. BOINC Tasks Tab
	Manage distributed computing workloads:

	Submit Task:
	1. Select task type (Variant Calling, BLAST, Alignment)
	2. Enter input file path
	3. Click "Submit Task"

	Monitor Tasks:
	- View all tasks with status (Pending, Running, Completed)
	- See task creation time and type
	- Check overall statistics

	### 4. GDC Data Tab
	Browse available cancer projects:
	- TCGA-BRCA: Breast Cancer (1,098 cases)
	- TCGA-LUAD: Lung Adenocarcinoma (585 cases)
	- TCGA-COAD: Colon Adenocarcinoma (461 cases)
	- TCGA-GBM: Glioblastoma (617 cases)
	- TARGET-AML: Acute Myeloid Leukemia (238 cases)

	Click on a project to explore available datasets.

	### 5. Pipeline Tab
	Quick access to bioinformatics tools:
	- FASTQ QC: Quality control for sequencing data
	- BLAST Search: Sequence alignment and homology
	- Variant Calling: Identify genetic variants

	---

	## Working with Data

	### Querying with GraphQL

	Access the GraphQL playground at http://localhost:5000/graphql

	Example 1: Find mutations in TP53 gene
	```graphql
	query {
	mutations(gene: "TP53") {
	mutation_id
	chromosome
	position
	consequence
	}
	}
	```

	Example 2: Get patient information
	```graphql
	query {
	patients(project_id: "TCGA-BRCA", limit: 10) {
	patient_id
	age
	gender
	vital_status
	}
	}
	```

	Example 3: Cancer statistics
	```graphql
	query {
	cancerStatistics(cancer_type_id: "BRCA") {
	total_patients
	total_mutations
	avg_mutations_per_patient
	}
	}
	```

	### Using the REST API

	Get database summary:
	```bash
	curl http://localhost:5000/api/neo4j/summary
	```

	Search GDC files:
	```bash
	curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"
	```

	Submit BOINC task:
	```bash
	curl -X POST http://localhost:5000/api/boinc/submit \
	-H "Content-Type: application/json" \
	-d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'
	```

	---

	## Analysis Pipeline

	### 1. FASTQ Processing

	Quality Control:
	```python
	from backend.pipeline import FASTQProcessor

	processor = FASTQProcessor()
	stats = processor.calculate_statistics("input.fastq")
	print(f"Total reads: {stats['total_reads']}")
	print(f"Average quality: {stats['avg_quality']}")
	```

	Filter by quality:
	```python
	filtered = processor.quality_filter("input.fastq", "filtered.fastq")
	print(f"Pass rate: {filtered['pass_rate']:.2%}")
	```

	### 2. BLAST Alignment

	Run BLAST search:
	```python
	from backend.pipeline import BLASTRunner

	blast = BLASTRunner()
	results = blast.run_blastn("query.fasta")
	hits = blast.parse_results(results)

	print(f"Found {len(hits)} alignments")
	```

	Filter high-quality hits:
	```python
	filtered_hits = blast.filter_hits(hits, min_identity=0.95)
	```

	### 3. Variant Calling

	Identify variants:
	```python
	from backend.pipeline import VariantCaller

	caller = VariantCaller()
	vcf_file = caller.call_variants("alignment.bam", "reference.fa")
	variants = caller.filter_variants(vcf_file, min_quality=30)

	print(f"Identified {len(variants)} high-quality variants")
	```

	Find cancer-associated variants:
	```python
	from backend.pipeline import VariantAnalyzer

	analyzer = VariantAnalyzer()
	cancer_variants = analyzer.identify_cancer_variants(variants)
	tmb = analyzer.calculate_mutation_burden(variants)

	print(f"Cancer variants: {len(cancer_variants)}")
	print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")
	```

	---

	## Advanced Usage

	### Custom Neo4j Queries

	Direct Cypher queries:
	```python
	from backend.neo4j import DatabaseManager

	db = DatabaseManager()

	# Find patients with TP53 mutations
	query = """
	MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})
	RETURN p.patient_id, m.position, m.consequence
	"""

	results = db.execute_query(query)
	for result in results:
	print(result)

	db.close()
	```

	### Batch Data Import

	Import GDC data:
	```python
	from backend.gdc import GDCClient
	from backend.neo4j import DataImporter

	# Download mutation data
	gdc = GDCClient()
	files = gdc.get_mutation_data("TCGA-BRCA", limit=10)

	for file in files:
	gdc.download_file(file.file_id)

	# Import to Neo4j
	importer = DataImporter()
	importer.import_gdc_data(files)
	```

	### Custom BOINC Tasks

	Submit custom analysis:
	```python
	from backend.boinc import BOINCClient

	client = BOINCClient()

	# Submit multiple tasks
	input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]
	task_ids = []

	for file in input_files:
	task_id = client.submit_task("variant_calling", file)
	task_ids.append(task_id)

	# Monitor progress
	for task_id in task_ids:
	status = client.get_task_status(task_id)
	print(f"Task {task_id}: {status.status}")
	```

	### Configuration Customization

	Edit `config.yml`:

	```yaml
	neo4j:
	uri: "bolt://localhost:7687"
	password: "your_password"

	gdc:
	download_dir: "./data/gdc"
	max_retries: 3

	pipeline:
	fastq:
	quality_threshold: 25 # Increase quality threshold
	min_length: 75 # Increase minimum read length

	blast:
	evalue: 0.0001 # More stringent e-value
	num_threads: 8 # Use more CPU cores
	```

	---

	## Troubleshooting

	### Neo4j Connection Issues
	```bash
	# Check Neo4j status
	docker ps \| grep neo4j

	# Restart Neo4j
	docker-compose restart neo4j

	# View Neo4j logs
	docker-compose logs neo4j
	```

	### Memory Issues
	Increase Docker memory allocation:
	1. Open Docker Desktop Settings
	2. Resources → Memory
	3. Increase to at least 8GB
	4. Click "Apply & Restart"

	### API Errors
	Check logs:
	```bash
	# View application logs
	cat logs/cancer_at_home.log

	# Follow logs in real-time
	tail -f logs/cancer_at_home.log
	```

	---

	## Best Practices

	1. Data Management: Regularly clean up downloaded data to free space
	2. Task Monitoring: Check BOINC tasks periodically for failures
	3. Database Backup: Backup Neo4j data volume regularly
	4. Resource Limits: Monitor system resources when running large analyses
	5. API Rate Limits: Be mindful of GDC API rate limits for bulk downloads

	---

	## Support & Resources

	- Documentation: See README.md and QUICKSTART.md
	- API Reference: http://localhost:5000/docs
	- GraphQL Examples: See GRAPHQL_EXAMPLES.md
	- Logs: Check `logs/cancer_at_home.log`

	### Useful Cypher Queries

	Most common mutations:
	```cypher
	MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
	WITH m, count(p) as patient_count
	RETURN m.mutation_id, patient_count
	ORDER BY patient_count DESC
	LIMIT 10
	```

	Genes with most mutations:
	```cypher
	MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)
	WITH g, count(m) as mutation_count
	RETURN g.symbol, mutation_count
	ORDER BY mutation_count DESC
	LIMIT 10
	```

	Patient mutation profile:
	```cypher
	MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)
	RETURN g.symbol, m.consequence, m.position
	```