Summary

Problem: Early-stage cancer detection via cfDNA lacks simple, open approaches for fragment-based screening.

Approach: Python pipeline (using pysam, pandas, numpy) to parse BAM files, compute fragment-length distributions, generate histograms and summary statistics distinguishing normal vs cancer samples (~20 bp mean fragment shift).

Impact: Offers a reproducible fragmentomics benchmark; highlights early cancer signal via cfDNA fragmentation, bridges to diagnostics and scalable liquid biopsy innovation.

GitHub repo: Embed GitHub

Intro

This project is essentially a parsing script wrapped in a Python pipeline for BAM files that helps distinguish cancer from normal samples based on fragment length alone. There are many dimensions in the data that are not in this project, but definitely should be added*.

The fact that we are the start of catching cancer via cell-free DNA (cfDNA) circulating in the bloodstream, using nothing more than smart filtering and well-placed statistics, is pretty remarkable- and we need hardware that lives in the body to catch it early.

I have included the code, key findings, some additional information, and information about M&A + where opportunity still exists.

Embed GitHub

Project

A clean Python pipeline to analyze paired-end BAM files and output fragment length distributions.

Built using: pysam, numpy, pandas, matplotlib, seaborn.
Handles real patient BAMs or mock test datasets.
Generates:

CSVs of fragment lengths
Summary statistics (mean, median, std, range, percentiles)
Histograms and batch comparison plots

Key Findings

Normal Sample (normal_sample_01)

Mean fragment length: ~167 bp
Range: 50–330 bp
Distribution: Symmetric, centered around the mean

Cancer Sample (cancer_sample_01)

Mean fragment length: ~146 bp (~20 bp shorter than normal)
Range: 50–330 bp
Distribution: Shifted toward shorter fragments

Clinical Significance

The ~20 bp difference in mean fragment length between normal and cancer samples is clinically meaningful and consistent with published cfDNA studies
Cancer cfDNA is shorter due to increased cell death (apoptosis/necrosis) and different fragmentation patterns
The observed filtering rate (~50%) indicates robust quality control and realistic data retention

What the Results Mean

The updated pipeline reliably distinguishes between normal and cancer cfDNA fragmentation patterns on this specific dataset (which is synthetic)
Fragment length analysis can serve as a biomarker for cancer detection and monitoring
The shorter cfDNA fragments in cancer samples can inform amplicon design for qPCR + other molecular assays

Other Data Sources to Test

SRA (NCBI): search "cfDNA AND cancer AND paired-end"
TCGA: for cancer WGS data
EGA: for European liquid biopsy studies
ICGC: for global cancer genomics
CCGA (Grail): large-scale, multi-cancer liquid biopsy

I also built a mock generator to simulate both normal and cancer BAMs.

cfDNA M&A Landscape

Company	Acquirer	Price	Metrics
Verinata Health	Illumina (2013)	$350M + $100M earn-outs	99% sensitivity (T21), 40M reads/sample
Ariosa Diagnostics	Roche (2014)	$411M + $154M	Harmony test: 100% T21, 0.06% FPR
Sequenom	LabCorp (2016)	$371M	MaterniT21 PLUS: 99.1% T21
CAPP Medical	Roche (2015)	$96M	CAPP-Seq: detects 0.02% mutant-allele-fraction
Sysmex Inostics	Sysmex Corp. (2013)	~$122M	BEAMing dPCR: 0.01% VAF
Foundation Medicine	Roche (2018)	$2.4B (42.5%)	FoundationOne Liquid CDx: 324 genes, >99.9% NPA
Thrive Earlier Detection	Exact Sciences (2021)	Up to $2.15B	CancerSEEK: 8-cancer panel, 99% specificity

Acquirers paid 1–4 × annual sales for prenatal screens, but >$2 B (≈10–20x revenue) for pan-cancer liquid-biopsy platforms once sub-0.5% VAF fidelity and broad panels (>300 genes or multi-omic) were proven. Precision, breadth, and clinical-utility data were rewarded, not mere assay throughput. The acquirer requires 0 downside legal risk via provable results.

Innovation in cfDNA

Already Been Done

Trisomy (T21/T18/T13) prenatal tests: essentially solved (>99% sensitivity/specificity). Commoditized
High-VAF cancer detection (late-stage): solved by multiple players (Guardant, Foundation Medicine, etc)
Fragmentomics signal extraction (like this project): proven viable for tissue-of-origin, cancer stage, etc
Single-mutation ctDNA assays (e.g. EGFR in lung cancer): standard of care

Conclusion: Most high-frequency, single-gene, late-stage detection markets are mature or saturated.

Not Solved Yet

Early-stage, pan-cancer detection

Detecting low VAF (<0.1%) from plasma for early-stage cancers is still extremely difficult
CancerSEEK (Thrive) and Galleri (GRAIL) have promising data but still have ~50% sensitivity for early-stage
Opportunity: improve precision, or segment by organ/tissue, or multi-modal fusion (cfDNA + protein, etc.)

Longitudinal tracking and relapse prediction

Opportunity in tracking treatment response, MRD (minimal residual disease), or relapse
Needs ultra-sensitive, patient-specific custom assays (e.g. TARDIS, CAPP-Seq)

Huge opportunity in personalized pipelines or software for this

Real-time, low-cost cfDNA platforms

Existing platforms are expensive and slow (centralized NGS, weeks of turnaround)
Opportunity: local, fast, cheap versions — especially for global markets or low-resource hospitals
Example: Oxford Nanopore-based cfDNA detection

Fragmentomics + LLMs, protein-aware models

Most models today are linear (coverage histograms, FFT)
Opportunity: learn latent structure (deep models, embeddings) of cfDNA fragmentation + integrate with epigenetics
Could unlock non-cancer signals too: transplant rejection, inflammation, autoimmune disease

*Additional cfDNA dimensions

Fragment start position (genomic coordinates)
Fragment end position
Strand orientation (forward/reverse)
Read pair orientation (FR, RF, etc)
GC content of the fragment
Nucleosome positioning
Repeat region annotation
Distance to nearest transcription start site (TSS)
Overlap with exons/introns/intergenic
Fragment coverage depth (per region)
Fragmentation entropy (per region)
Fragment end motifs (preferred trinucleotides)
DNA methylation levels (if available)
UMI counts (for deduplication)
Fragment jaggedness (non-canonical ends)
Sample-level metadata (age, cancer type, stage)
Fragment phasing (linked alleles)
Fragment insert size distribution slope
Copy number variation signals
Variant allele frequency (if aligned to SNVs)
Fragment density vs. GC bias correlation

cfDNA Fragmentomics with Python