Summary

Problem: Diagnostic variant interpretation workflows are proprietary, fragmented, and opaque.

Approach: Prototype engine that ingests FASTQ/VCF, annotates variants via ClinVar, gnomAD, CADD, PolyPhen, SIFT, applies ACMG classification, and emits clinician-ready reports in HTML/JSON.

Impact: Demonstrates feasibility of an open-source, vertical “copilot” for clinical genomics that could serve small labs, startups, and accelerate adoption of integrated, transparent variant interpretation.

GitHub repo: Embed GitHub

Why This Exists

Genetic testing has grown exponentially in clinical settings and variant calling has now become an industry in its own right. Tools like Fabric Genomics, VarSeq, DRAGEN, and Sentieon’s tools exist, but they’re closed and expensive.

So I built a working prototype of a VCF → ACMG variant classifier, designed to eventually serve as a copilot for diagnostic interpretation.

It’s a copilot-style tool that takes in patient sequencing data (FASTQ/VCF) and outputs a structured, clinically-interpretable variant report using ACMG guidelines. It supports diagnostic WES/WGS, prenatal screens, CNVs, and integrates phenotype-based gene prioritization.

That being said, there are TONS of VCF algorithms (fragmented, could use 1 vertically integrated NN that serves every human), but I still learned core concepts on how the algorithm works, open source packages to use, and how to parallelize. See below.

Tool	Tech	Notes
GATK HaplotypeCaller	Bayesian + local assembly	Industry standard, slower
FreeBayes	Haplotype-based	Faster, flexible
DeepVariant	Deep learning (CNN)	Google, accurate, slowish
Clair3	Transformer + pileups	Fast + accurate w/ long reads
Sentieon DNAscope	Optimized GATK clone	~10x faster, identical results
DRAGEN	FPGA, deterministic	Fastest overall, costly setup
Parabricks	GPU GATK	30–50x faster on A100s
Mutect2	Bayesian + assembly	Gold standard for somatic variants
Strelka2	Statistical models	Fast, accurate for somatic/germline
VarDict	Heuristic	Very sensitive, customizable
DRAGEN Somatic	FPGA	Includes CNV/SV calling
PEPPER-Margin-DeepVariant	Hybrid Deep + Assembly	High-accuracy long-read pipeline
Sniffles	Signature-based	Focused on structural variants
CuteSV	SV caller	Efficient SV detection
GLnexus	Joint genotyping	Used with DeepVariant
GATK GenotypeGVCFs	Genotype merging	Standard for GATK pipelines
Sentieon JointCaller	Optimized joint caller	Fast + deterministic
NanoCaller	ONT-based SNP/SV	Designed for nanopore reads
VarScan	Heuristic + stats	SNP + CNV, legacy-friendly
Octopus	Haplotype-aware w/ ML	Probabilistic, very flexible
GraphTyper	Graph-based alignment	Emerging graph-genome caller
Sparrow	pure transformer	Multi-sample variant calling

⚠️ Note: the APIs are functionally correct but require specific variant formatting for production use. I kept getting formatting issues when calling them and have run out of time and patience. The test variants weren't found in the databases, but the infrastructure should work once those formatting issues are fixed. I will return to them at a later date, but for now, I need to move onto the next project. I have a 50GB WGS genome I want to run through this and that’s on the project list, so it’ll have to be fixed and performant by then.

Project

Again, this is a prototype genetic variant interpreter, which is basically an engine that:

Takes a patient's genetic data from a VCF file, which is like a spreadsheet of genetic differences
Compares it to reference databases that have reference genomes to understand:

"Is this genetic change common in healthy people?" (gnomAD)
"What diseases is this change associated with?" (ClinVar)
"What gene is this in and how might it affect a specific protein?" (Ensembl)
Basically looking for deviations from the “average” genome

Classifies the variant using medical guidelines (ACMG) as:

Pathogenic (likely disease-causing)
Benign (likely harmless)
VUS (Variant of Uncertain Significance)

Generates a clinical report for doctors + AI’s to use in diagnosis

How this Engine Works

Feed in a .vcf file with human variants, and the copilot will:

Parse and iterate through variants (cyvcf2)
Match against ClinVar classifications
Query allele frequencies from gnomAD
Annotate with CADD, PolyPhen, and SIFT predictions
Assign ACMG-style tags (PVS1, PM2, PP3)
Generate a readable .html or .json report

→ Flagging variants as Pathogenic, Likely Pathogenic, VUS, etc.

You can run it locally on any VCF as a CLI tool.

Code

⚠️ The code does not follow DRY principles very well- it could use a run through by an optimizer (probably have to be me). It could also use tighter data structures.

Anyways, here is the main annotation engine (vcf_copilot/annotators/engine.py). The annotation adds additional details to the raw genetic coordinates turning it into clinically actionable information that doctors + AI can use for diagnosis and treatment decisions.

What the UI looks like to display results.

Tech Stack

Python backend + future Rust parser (via rust-htslib)
cyvcf2, requests, jinja2 templating
REST APIs (gnomAD, ClinVar, Ensembl — in progress)
Terminal interface with typer and rich

Performance Improvements

There are a lot of improvements, especially with regard to threading.

Layer	Parallelism
Variant annotation	Per-variant (thread, process, async)
Region calling (BAM)	Scatter across genome
API calls	Batch/async, stream
Reporting	Independent
Core file I/O (VCF parsing)	⚠️ Bounded by disk IO/thread-safe lib (also on a Mac)

Conclusion

It’s just a project in its current state. I would almost go as far as saying it’s a work in progress. But there is a lot that can be expanded upon to turn it into an engine:

Full API wiring (gnomAD, ClinVar, Ensembl)
Parallel VCF parsing in Rust
NLP summary of gene-disease relationships
Plugin support for CNVs, pharmacogenomics, mitochondrial calls

Eventually: a Nextflow-based clinical interpretation engine usable by small labs, researchers, and startup diagnostics companies.

I’m pretty irritated by this project. I’ll probably return to this eventually, but there are so many options for variant calling, and I really think the industry could benefit from one massively vertically integrated neural network.

Automated Variant Interpretation with a Copilot (FASTQ → Clinic)