Automated Variant Interpretation with a Copilot (FASTQ → Clinic)

Automated Variant Interpretation with a Copilot (FASTQ → Clinic)

Summary

Problem: Diagnostic variant interpretation workflows are proprietary, fragmented, and opaque.

Approach: Prototype engine that ingests FASTQ/VCF, annotates variants via ClinVar, gnomAD, CADD, PolyPhen, SIFT, applies ACMG classification, and emits clinician-ready reports in HTML/JSON.

Impact: Demonstrates feasibility of an open-source, vertical “copilot” for clinical genomics that could serve small labs, startups, and accelerate adoption of integrated, transparent variant interpretation.

GitHub repo: Embed GitHubEmbed GitHub

Why This Exists

Genetic testing has grown exponentially in clinical settings and variant calling has now become an industry in its own right. Tools like Fabric Genomics, VarSeq, DRAGEN, and Sentieon’s tools exist, but they’re closed and expensive.

So I built a working prototype of a VCF → ACMG variant classifier, designed to eventually serve as a copilot for diagnostic interpretation.

It’s a copilot-style tool that takes in patient sequencing data (FASTQ/VCF) and outputs a structured, clinically-interpretable variant report using ACMG guidelines. It supports diagnostic WES/WGS, prenatal screens, CNVs, and integrates phenotype-based gene prioritization.

That being said, there are TONS of VCF algorithms (fragmented, could use 1 vertically integrated NN that serves every human), but I still learned core concepts on how the algorithm works, open source packages to use, and how to parallelize. See below.

Tool
Tech
Notes
GATK HaplotypeCaller
Bayesian + local assembly
Industry standard, slower
FreeBayes
Haplotype-based
Faster, flexible
DeepVariant
Deep learning (CNN)
Google, accurate, slowish
Clair3
Transformer + pileups
Fast + accurate w/ long reads
Sentieon DNAscope
Optimized GATK clone
~10x faster, identical results
DRAGEN
FPGA, deterministic
Fastest overall, costly setup
Parabricks
GPU GATK
30–50x faster on A100s
Mutect2
Bayesian + assembly
Gold standard for somatic variants
Strelka2
Statistical models
Fast, accurate for somatic/germline
VarDict
Heuristic
Very sensitive, customizable
DRAGEN Somatic
FPGA
Includes CNV/SV calling
PEPPER-Margin-DeepVariant
Hybrid Deep + Assembly
High-accuracy long-read pipeline
Sniffles
Signature-based
Focused on structural variants
CuteSV
SV caller
Efficient SV detection
GLnexus
Joint genotyping
Used with DeepVariant
GATK GenotypeGVCFs
Genotype merging
Standard for GATK pipelines
Sentieon JointCaller
Optimized joint caller
Fast + deterministic
NanoCaller
ONT-based SNP/SV
Designed for nanopore reads
VarScan
Heuristic + stats
SNP + CNV, legacy-friendly
Octopus
Haplotype-aware w/ ML
Probabilistic, very flexible
GraphTyper
Graph-based alignment
Emerging graph-genome caller
Sparrow
pure transformer
Multi-sample variant calling

⚠️ Note: the APIs are functionally correct but require specific variant formatting for production use. I kept getting formatting issues when calling them and have run out of time and patience. The test variants weren't found in the databases, but the infrastructure should work once those formatting issues are fixed. I will return to them at a later date, but for now, I need to move onto the next project. I have a 50GB WGS genome I want to run through this and that’s on the project list, so it’ll have to be fixed and performant by then.

Project

Again, this is a prototype genetic variant interpreter, which is basically an engine that:

  1. Takes a patient's genetic data from a VCF file, which is like a spreadsheet of genetic differences
  2. Compares it to reference databases that have reference genomes to understand:
    • "Is this genetic change common in healthy people?" (gnomAD)
    • "What diseases is this change associated with?" (ClinVar)
    • "What gene is this in and how might it affect a specific protein?" (Ensembl)
    • Basically looking for deviations from the “average” genome
  3. Classifies the variant using medical guidelines (ACMG) as:
    • Pathogenic (likely disease-causing)
    • Benign (likely harmless)
    • VUS (Variant of Uncertain Significance)
  4. Generates a clinical report for doctors + AI’s to use in diagnosis

How this Engine Works

Feed in a .vcf file with human variants, and the copilot will:

  • Parse and iterate through variants (cyvcf2)
  • Match against ClinVar classifications
  • Query allele frequencies from gnomAD
  • Annotate with CADD, PolyPhen, and SIFT predictions
  • Assign ACMG-style tags (PVS1, PM2, PP3)
  • Generate a readable .html or .json report
  • → Flagging variants as Pathogenic, Likely Pathogenic, VUS, etc.

You can run it locally on any VCF as a CLI tool.

Code

⚠️  The code does not follow DRY principles very well- it could use a run through by an optimizer (probably have to be me). It could also use tighter data structures.

Anyways, here is the main annotation engine (vcf_copilot/annotators/engine.py). The annotation adds additional details to the raw genetic coordinates turning it into clinically actionable information that doctors + AI can use for diagnosis and treatment decisions.

image

What the UI looks like to display results.

Tech Stack

  • Python backend + future Rust parser (via rust-htslib)
  • cyvcf2, requests, jinja2 templating
  • REST APIs (gnomAD, ClinVar, Ensembl — in progress)
  • Terminal interface with typer and rich

Performance Improvements

There are a lot of improvements, especially with regard to threading.

Layer
Parallelism
Variant annotation
Per-variant (thread, process, async)
Region calling (BAM)
Scatter across genome
API calls
Batch/async, stream
Reporting
Independent
Core file I/O (VCF parsing)
⚠️ Bounded by disk IO/thread-safe lib (also on a Mac)

Conclusion

It’s just a project in its current state. I would almost go as far as saying it’s a work in progress. But there is a lot that can be expanded upon to turn it into an engine:

  • Full API wiring (gnomAD, ClinVar, Ensembl)
  • Parallel VCF parsing in Rust
  • NLP summary of gene-disease relationships
  • Plugin support for CNVs, pharmacogenomics, mitochondrial calls

Eventually: a Nextflow-based clinical interpretation engine usable by small labs, researchers, and startup diagnostics companies.

I’m pretty irritated by this project. I’ll probably return to this eventually, but there are so many options for variant calling, and I really think the industry could benefit from one massively vertically integrated neural network.