cfDNA Fragmentomics with Python

cfDNA Fragmentomics with Python

Summary

Problem: Early-stage cancer detection via cfDNA lacks simple, open approaches for fragment-based screening.

Approach: Python pipeline (using pysam, pandas, numpy) to parse BAM files, compute fragment-length distributions, generate histograms and summary statistics distinguishing normal vs cancer samples (~20 bp mean fragment shift).

Impact: Offers a reproducible fragmentomics benchmark; highlights early cancer signal via cfDNA fragmentation, bridges to diagnostics and scalable liquid biopsy innovation.

GitHub repo: Embed GitHubEmbed GitHub

Intro

This project is essentially a parsing script wrapped in a Python pipeline for BAM files that helps distinguish cancer from normal samples based on fragment length alone. There are many dimensions in the data that are not in this project, but definitely should be added*.

The fact that we are the start of catching cancer via cell-free DNA (cfDNA) circulating in the bloodstream, using nothing more than smart filtering and well-placed statistics, is pretty remarkable- and we need hardware that lives in the body to catch it early.

I have included the code, key findings, some additional information, and information about M&A + where opportunity still exists.

Project

A clean Python pipeline to analyze paired-end BAM files and output fragment length distributions.

  • Built using: pysam, numpy, pandas, matplotlib, seaborn.
  • Handles real patient BAMs or mock test datasets.
  • Generates:
    • CSVs of fragment lengths
    • Summary statistics (mean, median, std, range, percentiles)
    • Histograms and batch comparison plots

Key Findings

Normal Sample (normal_sample_01)

  • Mean fragment length: ~167 bp
  • Range: 50–330 bp
  • Distribution: Symmetric, centered around the mean

Cancer Sample (cancer_sample_01)

  • Mean fragment length: ~146 bp (~20 bp shorter than normal)
  • Range: 50–330 bp
  • Distribution: Shifted toward shorter fragments

Clinical Significance

  • The ~20 bp difference in mean fragment length between normal and cancer samples is clinically meaningful and consistent with published cfDNA studies
  • Cancer cfDNA is shorter due to increased cell death (apoptosis/necrosis) and different fragmentation patterns
  • The observed filtering rate (~50%) indicates robust quality control and realistic data retention

What the Results Mean

  • The updated pipeline reliably distinguishes between normal and cancer cfDNA fragmentation patterns on this specific dataset (which is synthetic)
  • Fragment length analysis can serve as a biomarker for cancer detection and monitoring
  • The shorter cfDNA fragments in cancer samples can inform amplicon design for qPCR + other molecular assays

Other Data Sources to Test

  • SRA (NCBI): search "cfDNA AND cancer AND paired-end"
  • TCGA: for cancer WGS data
  • EGA: for European liquid biopsy studies
  • ICGC: for global cancer genomics
  • CCGA (Grail): large-scale, multi-cancer liquid biopsy

I also built a mock generator to simulate both normal and cancer BAMs.

cfDNA M&A Landscape

Company
Acquirer
Price
Metrics
Verinata Health
Illumina (2013)
$350M + $100M earn-outs
99% sensitivity (T21), 40M reads/sample
Ariosa Diagnostics
Roche (2014)
$411M + $154M
Harmony test: 100% T21, 0.06% FPR
Sequenom
LabCorp (2016)
$371M
MaterniT21 PLUS: 99.1% T21
CAPP Medical
Roche (2015)
$96M
CAPP-Seq: detects 0.02% mutant-allele-fraction
Sysmex Inostics
Sysmex Corp. (2013)
~$122M
BEAMing dPCR: 0.01% VAF
Foundation Medicine
Roche (2018)
$2.4B (42.5%)
FoundationOne Liquid CDx: 324 genes, >99.9% NPA
Thrive Earlier Detection
Exact Sciences (2021)
Up to $2.15B
CancerSEEK: 8-cancer panel, 99% specificity

Acquirers paid 1–4 × annual sales for prenatal screens, but >$2 B (≈10–20x revenue) for pan-cancer liquid-biopsy platforms once sub-0.5% VAF fidelity and broad panels (>300 genes or multi-omic) were proven. Precision, breadth, and clinical-utility data were rewarded, not mere assay throughput. The acquirer requires 0 downside legal risk via provable results.

Innovation in cfDNA

Already Been Done

  • Trisomy (T21/T18/T13) prenatal tests: essentially solved (>99% sensitivity/specificity). Commoditized
  • High-VAF cancer detection (late-stage): solved by multiple players (Guardant, Foundation Medicine, etc)
  • Fragmentomics signal extraction (like this project): proven viable for tissue-of-origin, cancer stage, etc
  • Single-mutation ctDNA assays (e.g. EGFR in lung cancer): standard of care

Conclusion: Most high-frequency, single-gene, late-stage detection markets are mature or saturated.

Not Solved Yet

Early-stage, pan-cancer detection

  • Detecting low VAF (<0.1%) from plasma for early-stage cancers is still extremely difficult
  • CancerSEEK (Thrive) and Galleri (GRAIL) have promising data but still have ~50% sensitivity for early-stage
  • Opportunity: improve precision, or segment by organ/tissue, or multi-modal fusion (cfDNA + protein, etc.)

Longitudinal tracking and relapse prediction

  • Opportunity in tracking treatment response, MRD (minimal residual disease), or relapse
  • Needs ultra-sensitive, patient-specific custom assays (e.g. TARDIS, CAPP-Seq)
    • Huge opportunity in personalized pipelines or software for this

Real-time, low-cost cfDNA platforms

  • Existing platforms are expensive and slow (centralized NGS, weeks of turnaround)
  • Opportunity: local, fast, cheap versions — especially for global markets or low-resource hospitals
  • Example: Oxford Nanopore-based cfDNA detection

Fragmentomics + LLMs, protein-aware models

  • Most models today are linear (coverage histograms, FFT)
  • Opportunity: learn latent structure (deep models, embeddings) of cfDNA fragmentation + integrate with epigenetics
  • Could unlock non-cancer signals too: transplant rejection, inflammation, autoimmune disease

*Additional cfDNA dimensions

  • Fragment start position (genomic coordinates)
  • Fragment end position
  • Strand orientation (forward/reverse)
  • Read pair orientation (FR, RF, etc)
  • GC content of the fragment
  • Nucleosome positioning
  • Repeat region annotation
  • Distance to nearest transcription start site (TSS)
  • Overlap with exons/introns/intergenic
  • Fragment coverage depth (per region)
  • Fragmentation entropy (per region)
  • Fragment end motifs (preferred trinucleotides)
  • DNA methylation levels (if available)
  • UMI counts (for deduplication)
  • Fragment jaggedness (non-canonical ends)
  • Sample-level metadata (age, cancer type, stage)
  • Fragment phasing (linked alleles)
  • Fragment insert size distribution slope
  • Copy number variation signals
  • Variant allele frequency (if aligned to SNVs)
  • Fragment density vs. GC bias correlation