“We need better machine learning that can make use of the growing data scale and better matches the inductive biases of the data generating process.” - John Jumper
(aka a fully vertically integrated, end to end, capture → modeling → product, system)
Gene Therapy Problems
This is a brief list of problems that can use deep learning to solve at least a section of them. Ultimately, they are bounded by data, preprocessing, algorithms, and inductive biases. The systems that solve these will need to be fully connected end to end, which is downstream of accuracy, which is derivative of data/algorithmic efficiency.
Below are quick, back-of-the-envelope architecture examples for each. I am not claiming they are accurate or even of high quality, just that they are there from brief thought.
1. Accurate delivery of RNA + DNA edits prediction
Goal: predict if RNA and DNA edits will reach intended cells
How: optimize delivery vectors (AAV serotypes, LNPs) for tissue specificity
Modeling Accurate Delivery of RNA + DNA Edits (1)2. Off-target effect prediction
Goal: predict unintended edits, immune responses, genomic instability
How: use sequence + structure-based models to assess + quantify risk
3. Long-term safety and efficacy effects modeling
Goal: predict systematic and time-dependent effects of edits
How: capture feedback loops in immunity, metabolism, and cell signaling
4. Capsid engineering + durability
Goal: optimize viral capsid design for durability and immune evasion
How: use DL for sequence-to-function prediction
5. Personalized gene edit recommendation
Goal: personalize therapy design based on genotype, phenotype, and disease context
How: fine-tune AlphaGenome on my personal WGS, use patient-specific omics data to guide edit selection
6. Human body + safety & efficacy prediction
Goal: full system simulation combining therapy design + body model
How: predict holistic simulation combining therapy design + body model
7. Gene therapy for common conditions
Goal: extend ML-enabled gene therapy tools beyond rare diseases to common diseases
How: target high-impact diseases like diabetes, cardiovascular, autoimmune, etc
Datasets
The five selected sources give us a solid foundation based on:
- raw experimental data (for delivery/off-target/omics)
- structural data (for capsid design)
- vector designs (for reconstruction or simulation)
- curated functional and clinical datasets (so we’re not reinventing preprocessing)
Source | What It Covers | Problem Areas Covered | Why It’s Included |
Raw sequencing datasets from published studies, including:
- DNA barcoding for biodistribution
- GUIDE-seq off-target detection
- scRNA-seq delivery mapping
- RNA-seq time series
- WGS post-edit | 1, 2, 3, 6, 7 | - one-stop shop for raw molecular data;
- we can query “AAV tropism,” “capsid evolution,” “gene therapy trial RNA-seq,” etc | |
- 3D structures of AAV capsids
- antibody-bound complexes
- receptor-binding domains | 4 | - structural ML
- capsid engineering
- immune escape modeling | |
Plasmid maps/sequences for: - AAV backbones
- promoters
- capsid variants
- reporter constructs | 1, 4, 5, 7 | - central source for vector architecture
- lets us reproduce or simulate vector designs | |
Curated database of variants with clinical significance annotations | 5 | - lets us train/edit-target recommendation models without digging through unstructured literature | |
NIH multimodal datasets for genome editing safety & delivery- includes sequencing, expression, immune assays, vector info | 1, 2, 3, 6 | - gives us processed and structured datasets
- covers multiple organs & time points | |
Single cell perturbation datasets | 1, 3 | - requires a mutliple datasets to really achieve any of the other functionality | |
300M+ scRNA-seq data | 1, 3 | - same as above, but more data and honestly a lot more modern user experience (programmatic) | |
“in 2016, which showed that mixing barcoded cell lines together and adding drugs to the pools showed the same drug response as screening in individual cell lines” | 1, 2, 5, 6 | - add a genome, scRNA sequencing for someone | |
8M cells, targeting all human protein-coding genes, with deep sequencing of 16,000+ unique molecular identifiers (UMIs) per cell | 3, 5, 6 | - trained on CZ CellxGene from human health donors
- 521GB | |
how many targetable cells exist in each tissue, by type and size | 1, 2, 6 | - model of tropism needs weights
- how does the therapy distribute across all ~36T cells
- inverse size–count law implies delivery vectors face a conserved tradeoff
- provides the prior distribution of target cells |
Technology Used to Capture AAV Data
DNA barcoding + NGS: tissue or cell type resolution, can do high throughput, tropism atlas for hundreds of AAVs can be built up
- need to kill the cells in order to get this data
Fluorescence: whole-animal/tissue resolution, track the GFP moving through the body to the desired locations
qPCR/ddPCR: tissue resolution, absolute quantification of vg/cell, can get regulatory/submission-level quality data
scRNA-seq: single-cell resolution, resolves which cell-types were hit
Spatial transcriptomics: sub-tissue resolution, shows where in the tissue the expression occurred, can use it to show delivery to the hippocampus subregions
Models
There are a lot, but for solving problem #1, I encountered a lot of models that could be can combine into something useful.
1. Protein Language Models
- ESM-2 (Evolutionary Scale Modeling): for capsid protein sequence embeddings
- Evo2: for capsid protein sequence embeddings
2. Single-Cell Analysis Models
- scVI (Single-Cell Variational Inference): for cell type embeddings and synthetic data generation
- scDesign3: for synthetic scRNA-seq data generation
- scDiffusion: alternative for synthetic data generation
- Arc State: for cell state modeling and perturbation analysis
3. Machine Learning Models
- MLP (Multi-Layer Perceptron): for tropism prediction
- XGBoost: for tropism prediction (ensemble method)
- LightGBM: alternative gradient boosting method
- GBDT (Gradient Boosted Decision Trees): for tropism prediction
4. Statistical Models
- scVI: for batch correction and latent space modeling
- BBKNN: for batch correction
- Scanorama: for batch correction
Notes (for myself)
There needs to be a consumer market that is big and general enough to act as a forcing function on the quality, cost, efficiency, and accuracy of delivery for gene therapy technology. This will require an entrepreneurial shift, where this becomes a general technology accessible to all people. For instance, gene therapy for common conditions rather than rare diseases:
- soreness
- gene where you only need 4 hours of sleep
- come up/come down quicker from caffeine
- better sex life
- faster metabolism