“We need better machine learning that can make use of the growing data scale and better matches the inductive biases of the data generating process.” - John Jumper

(aka a fully vertically integrated, end to end, capture → modeling → product, system)

Gene Therapy Problems

This is a brief list of problems that can use deep learning to solve at least a section of them. Ultimately, they are bounded by data, preprocessing, algorithms, and inductive biases. The systems that solve these will need to be fully connected end to end, which is downstream of accuracy, which is derivative of data/algorithmic efficiency.

Below are quick, back-of-the-envelope architecture examples for each. I am not claiming they are accurate or even of high quality, just that they are there from brief thought.

1. Accurate delivery of RNA + DNA edits prediction

Goal: predict if RNA and DNA edits will reach intended cells

How: optimize delivery vectors (AAV serotypes, LNPs) for tissue specificity

Modeling Accurate Delivery of RNA + DNA Edits (1)

2. Off-target effect prediction

Goal: predict unintended edits, immune responses, genomic instability

How: use sequence + structure-based models to assess + quantify risk

3. Long-term safety and efficacy effects modeling

Goal: predict systematic and time-dependent effects of edits

How: capture feedback loops in immunity, metabolism, and cell signaling

4. Capsid engineering + durability

Goal: optimize viral capsid design for durability and immune evasion

How: use DL for sequence-to-function prediction

5. Personalized gene edit recommendation

Goal: personalize therapy design based on genotype, phenotype, and disease context

How: fine-tune AlphaGenome on my personal WGS, use patient-specific omics data to guide edit selection

6. Human body + safety & efficacy prediction

Goal: full system simulation combining therapy design + body model

How: predict holistic simulation combining therapy design + body model

7. Gene therapy for common conditions

Goal: extend ML-enabled gene therapy tools beyond rare diseases to common diseases

How: target high-impact diseases like diabetes, cardiovascular, autoimmune, etc

Datasets

The five selected sources give us a solid foundation based on:

raw experimental data (for delivery/off-target/omics)
structural data (for capsid design)
vector designs (for reconstruction or simulation)
curated functional and clinical datasets (so we’re not reinventing preprocessing)

Source	What It Covers	Problem Areas Covered	Why It’s Included
NCBI SRA (Sequence Read Archive)	Raw sequencing datasets from published studies, including: - DNA barcoding for biodistribution - GUIDE-seq off-target detection - scRNA-seq delivery mapping - RNA-seq time series - WGS post-edit	1, 2, 3, 6, 7	- one-stop shop for raw molecular data; - we can query “AAV tropism,” “capsid evolution,” “gene therapy trial RNA-seq,” etc
PDB (Protein Data Bank)	- 3D structures of AAV capsids - antibody-bound complexes - receptor-binding domains	4	- structural ML - capsid engineering - immune escape modeling
Addgene	Plasmid maps/sequences for: - AAV backbones - promoters - capsid variants - reporter constructs	1, 4, 5, 7	- central source for vector architecture - lets us reproduce or simulate vector designs
ClinVar (NCBI)	Curated database of variants with clinical significance annotations	5	- lets us train/edit-target recommendation models without digging through unstructured literature
SCGE Consortium Data Portal	NIH multimodal datasets for genome editing safety & delivery- includes sequencing, expression, immune assays, vector info	1, 2, 3, 6	- gives us processed and structured datasets - covers multiple organs & time points
scPerturb	Single cell perturbation datasets	1, 3	- requires a mutliple datasets to really achieve any of the other functionality
Arc Virtual Cell Atlas	300M+ scRNA-seq data	1, 3	- same as above, but more data and honestly a lot more modern user experience (programmatic)
Prism	“in 2016, which showed that mixing barcoded cell lines together and adding drugs to the pools showed the same drug response as screening in individual cell lines”	1, 2, 5, 6	- add a genome, scRNA sequencing for someone
X-Atlas/Orion	8M cells, targeting all human protein-coding genes, with deep sequencing of 16,000+ unique molecular identifiers (UMIs) per cell	3, 5, 6	- trained on CZ CellxGene from human health donors - 521GB
Human Cell Data	how many targetable cells exist in each tissue, by type and size	1, 2, 6	- model of tropism needs weights - how does the therapy distribute across all ~36T cells - inverse size–count law implies delivery vectors face a conserved tradeoff - provides the prior distribution of target cells

Technology Used to Capture AAV Data

DNA barcoding + NGS: tissue or cell type resolution, can do high throughput, tropism atlas for hundreds of AAVs can be built up

need to kill the cells in order to get this data

Fluorescence: whole-animal/tissue resolution, track the GFP moving through the body to the desired locations

qPCR/ddPCR: tissue resolution, absolute quantification of vg/cell, can get regulatory/submission-level quality data

scRNA-seq: single-cell resolution, resolves which cell-types were hit

Spatial transcriptomics: sub-tissue resolution, shows where in the tissue the expression occurred, can use it to show delivery to the hippocampus subregions

Models

There are a lot, but for solving problem #1, I encountered a lot of models that could be can combine into something useful.

1. Protein Language Models

ESM-2 (Evolutionary Scale Modeling): for capsid protein sequence embeddings
Evo2: for capsid protein sequence embeddings

2. Single-Cell Analysis Models

scVI (Single-Cell Variational Inference): for cell type embeddings and synthetic data generation
scDesign3: for synthetic scRNA-seq data generation
scDiffusion: alternative for synthetic data generation
Arc State: for cell state modeling and perturbation analysis

3. Machine Learning Models

MLP (Multi-Layer Perceptron): for tropism prediction
XGBoost: for tropism prediction (ensemble method)
LightGBM: alternative gradient boosting method
GBDT (Gradient Boosted Decision Trees): for tropism prediction

4. Statistical Models

scVI: for batch correction and latent space modeling
BBKNN: for batch correction
Scanorama: for batch correction

Notes (for myself)

There needs to be a consumer market that is big and general enough to act as a forcing function on the quality, cost, efficiency, and accuracy of delivery for gene therapy technology. This will require an entrepreneurial shift, where this becomes a general technology accessible to all people. For instance, gene therapy for common conditions rather than rare diseases:

soreness
gene where you only need 4 hours of sleep
come up/come down quicker from caffeine
better sex life
faster metabolism

Can deep learning solve bottlenecks in gene therapy?