Summary
Problem: Diagnostic variant interpretation workflows are proprietary, fragmented, and opaque.
Approach: Prototype engine that ingests FASTQ/VCF, annotates variants via ClinVar, gnomAD, CADD, PolyPhen, SIFT, applies ACMG classification, and emits clinician-ready reports in HTML/JSON.
Impact: Demonstrates feasibility of an open-source, vertical “copilot” for clinical genomics that could serve small labs, startups, and accelerate adoption of integrated, transparent variant interpretation.
GitHub repo: Embed GitHub
Why This Exists
Genetic testing has grown exponentially in clinical settings and variant calling has now become an industry in its own right. Tools like Fabric Genomics, VarSeq, DRAGEN, and Sentieon’s tools exist, but they’re closed and expensive.
So I built a working prototype of a VCF → ACMG variant classifier, designed to eventually serve as a copilot for diagnostic interpretation.
It’s a copilot-style tool that takes in patient sequencing data (FASTQ/VCF) and outputs a structured, clinically-interpretable variant report using ACMG guidelines. It supports diagnostic WES/WGS, prenatal screens, CNVs, and integrates phenotype-based gene prioritization.
That being said, there are TONS of VCF algorithms (fragmented, could use 1 vertically integrated NN that serves every human), but I still learned core concepts on how the algorithm works, open source packages to use, and how to parallelize. See below.
Tool | Tech | Notes |
GATK HaplotypeCaller | Bayesian + local assembly | Industry standard, slower |
FreeBayes | Haplotype-based | Faster, flexible |
DeepVariant | Deep learning (CNN) | Google, accurate, slowish |
Clair3 | Transformer + pileups | Fast + accurate w/ long reads |
Sentieon DNAscope | Optimized GATK clone | ~10x faster, identical results |
DRAGEN | FPGA, deterministic | Fastest overall, costly setup |
Parabricks | GPU GATK | 30–50x faster on A100s |
Mutect2 | Bayesian + assembly | Gold standard for somatic variants |
Strelka2 | Statistical models | Fast, accurate for somatic/germline |
VarDict | Heuristic | Very sensitive, customizable |
DRAGEN Somatic | FPGA | Includes CNV/SV calling |
PEPPER-Margin-DeepVariant | Hybrid Deep + Assembly | High-accuracy long-read pipeline |
Sniffles | Signature-based | Focused on structural variants |
CuteSV | SV caller | Efficient SV detection |
GLnexus | Joint genotyping | Used with DeepVariant |
GATK GenotypeGVCFs | Genotype merging | Standard for GATK pipelines |
Sentieon JointCaller | Optimized joint caller | Fast + deterministic |
NanoCaller | ONT-based SNP/SV | Designed for nanopore reads |
VarScan | Heuristic + stats | SNP + CNV, legacy-friendly |
Octopus | Haplotype-aware w/ ML | Probabilistic, very flexible |
GraphTyper | Graph-based alignment | Emerging graph-genome caller |
Sparrow | pure transformer | Multi-sample variant calling |
⚠️ Note: the APIs are functionally correct but require specific variant formatting for production use. I kept getting formatting issues when calling them and have run out of time and patience. The test variants weren't found in the databases, but the infrastructure should work once those formatting issues are fixed. I will return to them at a later date, but for now, I need to move onto the next project. I have a 50GB WGS genome I want to run through this and that’s on the project list, so it’ll have to be fixed and performant by then.
Project
Again, this is a prototype genetic variant interpreter, which is basically an engine that:
- Takes a patient's genetic data from a VCF file, which is like a spreadsheet of genetic differences
- Compares it to reference databases that have reference genomes to understand:
- "Is this genetic change common in healthy people?" (gnomAD)
- "What diseases is this change associated with?" (ClinVar)
- "What gene is this in and how might it affect a specific protein?" (Ensembl)
- Basically looking for deviations from the “average” genome
- Classifies the variant using medical guidelines (ACMG) as:
- Pathogenic (likely disease-causing)
- Benign (likely harmless)
- VUS (Variant of Uncertain Significance)
- Generates a clinical report for doctors + AI’s to use in diagnosis
How this Engine Works
Feed in a .vcf
file with human variants, and the copilot will:
- Parse and iterate through variants (
cyvcf2
) - Match against ClinVar classifications
- Query allele frequencies from gnomAD
- Annotate with CADD, PolyPhen, and SIFT predictions
- Assign ACMG-style tags (PVS1, PM2, PP3)
- Generate a readable
.html
or.json
report
→ Flagging variants as Pathogenic, Likely Pathogenic, VUS, etc.
You can run it locally on any VCF as a CLI tool.
Code
⚠️ The code does not follow DRY principles very well- it could use a run through by an optimizer (probably have to be me). It could also use tighter data structures.
Anyways, here is the main annotation engine (vcf_copilot/annotators/engine.py). The annotation adds additional details to the raw genetic coordinates turning it into clinically actionable information that doctors + AI can use for diagnosis and treatment decisions.
What the UI looks like to display results.
Tech Stack
- Python backend + future Rust parser (via
rust-htslib
) cyvcf2
,requests
,jinja2
templating- REST APIs (gnomAD, ClinVar, Ensembl — in progress)
- Terminal interface with
typer
andrich
Performance Improvements
There are a lot of improvements, especially with regard to threading.
Layer | Parallelism |
Variant annotation | Per-variant (thread, process, async) |
Region calling (BAM) | Scatter across genome |
API calls | Batch/async, stream |
Reporting | Independent |
Core file I/O (VCF parsing) | ⚠️ Bounded by disk IO/thread-safe lib (also on a Mac) |
Conclusion
It’s just a project in its current state. I would almost go as far as saying it’s a work in progress. But there is a lot that can be expanded upon to turn it into an engine:
- Full API wiring (gnomAD, ClinVar, Ensembl)
- Parallel VCF parsing in Rust
- NLP summary of gene-disease relationships
- Plugin support for CNVs, pharmacogenomics, mitochondrial calls
Eventually: a Nextflow-based clinical interpretation engine usable by small labs, researchers, and startup diagnostics companies.
I’m pretty irritated by this project. I’ll probably return to this eventually, but there are so many options for variant calling, and I really think the industry could benefit from one massively vertically integrated neural network.