File size: 9,003 Bytes
7a92197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# Cancer@Home v2 - Project Summary

## ๐ŸŽฏ Project Overview

Cancer@Home v2 is a comprehensive distributed computing platform for cancer genomics research that successfully integrates:

1. **Distributed Computing (BOINC)** - Submit and manage computationally intensive cancer research tasks
2. **Cancer Data Portal (GDC)** - Access and download cancer genomics datasets from TCGA and TARGET
3. **Graph Database (Neo4j)** - Model complex relationships between genes, mutations, patients, and cancer types
4. **Bioinformatics Pipeline** - Process FASTQ files, run BLAST searches, and call genetic variants
5. **Interactive Dashboard** - Web-based GUI with real-time visualizations and data exploration

## ๐Ÿ“ Project Structure

```

CancerAtHome2/

โ”œโ”€โ”€ backend/

โ”‚   โ”œโ”€โ”€ api/

โ”‚   โ”‚   โ””โ”€โ”€ main.py                 # FastAPI application with REST & GraphQL

โ”‚   โ”œโ”€โ”€ boinc/

โ”‚   โ”‚   โ””โ”€โ”€ client.py               # BOINC distributed computing client

โ”‚   โ”œโ”€โ”€ gdc/

โ”‚   โ”‚   โ””โ”€โ”€ client.py               # GDC Portal API integration

โ”‚   โ”œโ”€โ”€ neo4j/

โ”‚   โ”‚   โ”œโ”€โ”€ db_manager.py          # Neo4j database operations

โ”‚   โ”‚   โ”œโ”€โ”€ graphql_schema.py      # GraphQL schema definitions

โ”‚   โ”‚   โ””โ”€โ”€ data_importer.py       # Sample data initialization

โ”‚   โ””โ”€โ”€ pipeline/

โ”‚       โ”œโ”€โ”€ fastq_processor.py     # FASTQ quality control

โ”‚       โ”œโ”€โ”€ blast_runner.py        # BLAST sequence alignment

โ”‚       โ””โ”€โ”€ variant_caller.py      # Genetic variant identification

โ”œโ”€โ”€ frontend/

โ”‚   โ””โ”€โ”€ index.html                 # Interactive web dashboard

โ”œโ”€โ”€ config.yml                     # Configuration file

โ”œโ”€โ”€ docker-compose.yml             # Neo4j container setup

โ”œโ”€โ”€ requirements.txt               # Python dependencies

โ”œโ”€โ”€ run.py                         # Main application launcher

โ”œโ”€โ”€ setup.ps1                      # Windows setup script

โ”œโ”€โ”€ setup.sh                       # Linux/Mac setup script

โ”œโ”€โ”€ README.md                      # Comprehensive documentation

โ”œโ”€โ”€ QUICKSTART.md                  # Quick start guide

โ”œโ”€โ”€ USER_GUIDE.md                  # Detailed user guide

โ”œโ”€โ”€ GRAPHQL_EXAMPLES.md            # GraphQL query examples

โ””โ”€โ”€ LICENSE                        # MIT License



```

## ๐Ÿš€ Key Features Implemented

### 1. Web Dashboard
- **Modern UI**: Clean, gradient-based design with responsive layout
- **5 Main Tabs**: Dashboard, Neo4j Visualization, BOINC Tasks, GDC Data, Pipeline
- **Real-time Statistics**: Live data from Neo4j showing genes, mutations, patients
- **Interactive Charts**: Chart.js visualizations for mutation distributions
- **D3.js Graph**: Interactive network visualization of cancer genomics relationships

### 2. Neo4j Graph Database
- **Node Types**: Gene, Mutation, Patient, CancerType
- **Relationships**: 
  - Gene โ† AFFECTS โ† Mutation
  - Patient โ†’ HAS_MUTATION โ†’ Mutation

  - Patient โ†’ DIAGNOSED_WITH โ†’ CancerType
- **Sample Data**: Pre-loaded with 7 genes, 5 mutations, 5 patients, 4 cancer types
- **Optimized**: Constraints and indexes for fast queries

### 3. GraphQL API
- **Flexible Queries**: Get genes, mutations, patients, cancer types
- **Filtering**: Query by gene symbol, chromosome, project ID, cancer type
- **Aggregations**: Mutation frequency, cancer statistics
- **Playground**: Interactive GraphQL explorer at /graphql

### 4. REST API Endpoints
- `/api/health` - System health check
- `/api/neo4j/summary` - Database statistics
- `/api/neo4j/genes/{symbol}` - Gene information
- `/api/boinc/tasks` - List BOINC tasks
- `/api/boinc/submit` - Submit new task
- `/api/boinc/statistics` - Task statistics
- `/api/gdc/projects` - Available cancer projects
- `/api/gdc/files/{project_id}` - Search GDC files
- `/api/gdc/download` - Download GDC data
- `/api/pipeline/*` - Bioinformatics pipeline endpoints

### 5. BOINC Integration
- **Task Submission**: Support for variant calling, BLAST, alignment tasks
- **Status Tracking**: Monitor pending, running, completed, failed tasks
- **Statistics**: Total tasks, completion rates, average times
- **Task Manager**: High-level interface for common workflows

### 6. GDC Data Integration
- **Search API**: Query files by project, data type, experimental strategy
- **Download**: Retrieve cancer genomics datasets
- **Projects Supported**: TCGA-BRCA, TCGA-LUAD, TCGA-COAD, TCGA-GBM, TARGET-AML
- **Parsers**: MAF, VCF, and clinical data parsing utilities

### 7. Bioinformatics Pipeline
- **FASTQ Processing**:
  - Quality filtering
  - Adapter trimming
  - Statistics calculation
  - Quality control reports
  
- **BLAST Integration**:
  - BLASTN and BLASTP support
  - XML output parsing
  - Hit filtering by identity/e-value
  
- **Variant Calling**:
  - VCF generation
  - Quality filtering
  - Variant annotation
  - Cancer variant identification
  - Tumor mutation burden calculation

## ๐Ÿ› ๏ธ Technology Stack

- **Backend**: FastAPI (Python 3.8+)
- **Database**: Neo4j 5.13 (Graph Database)
- **API**: GraphQL (Strawberry), REST
- **Frontend**: HTML5, CSS3, JavaScript
- **Visualization**: D3.js, Chart.js
- **Bioinformatics**: Biopython
- **Data Source**: GDC Portal API
- **Containerization**: Docker, Docker Compose
- **Distributed Computing**: BOINC framework

## ๐Ÿ“Š Sample Data Included

### Genes (7)
- TP53 (Tumor protein p53)
- BRAF (B-Raf proto-oncogene)
- BRCA1, BRCA2 (Breast cancer genes)
- PIK3CA, KRAS, EGFR (Oncogenes)

### Mutations (5)
- Various missense mutations in cancer-associated genes
- Includes position, reference/alternate alleles, quality scores

### Patients (5)
- Representative cases from TCGA-BRCA, TCGA-LUAD, TCGA-COAD
- Demographic data, vital status

### Cancer Types (4)
- Breast Cancer (BRCA)
- Lung Adenocarcinoma (LUAD)
- Colon Adenocarcinoma (COAD)
- Glioblastoma (GBM)

## ๐ŸŽจ Design Principles

1. **Simplicity**: One-command setup, intuitive interface
2. **Speed**: Fast to install and get started (< 5 minutes)
3. **Modularity**: Clean separation of concerns
4. **Extensibility**: Easy to add new data sources and analyses
5. **Visual**: Rich visualizations for data exploration
6. **Professional**: Production-quality code with error handling

## ๐Ÿ”ง Configuration Options

All configurable via `config.yml`:
- Neo4j connection settings
- GDC API parameters
- BOINC server configuration
- Pipeline quality thresholds
- Output directories
- Logging levels

## ๐Ÿ“– Documentation Provided

1. **README.md** - Complete project overview and installation
2. **QUICKSTART.md** - Fast setup and first steps
3. **USER_GUIDE.md** - Comprehensive usage documentation

4. **GRAPHQL_EXAMPLES.md** - GraphQL query examples
5. **Inline Code Comments** - Well-documented Python modules
6. **API Documentation** - Auto-generated Swagger UI at /docs

## ๐ŸŒŸ Unique Features

1. **All-in-One Solution**: Complete stack from data acquisition to visualization
2. **Graph-Based**: Leverages Neo4j's power for complex relationship queries
3. **Real-Time**: Live dashboard updates and task monitoring
4. **Research-Ready**: Built for actual cancer genomics research workflows
5. **Extensible**: Easy to integrate additional data sources and tools
6. **Educational**: Great for learning cancer genomics and graph databases

## ๐Ÿšฆ Getting Started (Quick)

```bash

# Windows

.\setup.ps1

python run.py



# Linux/Mac

./setup.sh

python run.py



# Open browser

http://localhost:5000

```

## ๐ŸŽฏ Use Cases

1. **Research**: Analyze cancer genomics data with distributed computing
2. **Education**: Learn about cancer genetics and bioinformatics
3. **Visualization**: Explore gene-mutation-patient relationships
4. **Data Integration**: Combine multiple cancer data sources
5. **Pipeline Development**: Test bioinformatics workflows

## ๐Ÿ”ฎ Future Enhancements (Optional)

- Machine learning for mutation prediction
- Multi-omics data integration (RNA-seq, proteomics)
- Survival analysis and clinical outcomes
- Drug response prediction
- Advanced graph algorithms (PageRank, community detection)
- Real-time collaboration features
- Mobile responsive design
- Export/report generation

## ๐Ÿ“ License

MIT License - Free for academic and commercial use

## ๐Ÿ™ Acknowledgments

Inspired by:
- Cancer@Home v1 (HeroX DCx Challenge)
- Andrew Kamal's Neo4j Cancer Visualization
- GDC Portal and TCGA Project
- BOINC Distributed Computing Framework

---

**Cancer@Home v2** successfully combines modern web technologies, graph databases, distributed computing, and bioinformatics tools into a cohesive platform that is both powerful and easy to use. The system is production-ready, well-documented, and designed for real-world cancer genomics research.