A foundation model is described as a model that has
- massive, self-supervised pretraining on diverse data (math does the work)
- transferable representations to be used by downstream machines (transfer learning, fine-tuning)
AlphaFold3: protein folding prediction based on amino acid sequences
AlphaGenome: predicts how single variants or mutations in human DNA sequences impact biological processes regulating genes
BigRNA: models RNA regulation, predicts tissue-specific RNA regulation like splicing, stability, binding, expression
Bioptimus Histology: vision transformer that transforms histopathology images into embeddings, enables biomarker prediction and tissue segmentation for treatment response
Blank.bio: models mRNA therapeutic design and prediction, like expression, stability, and manufacturability
Boltz2: protein folding prediction based on amino acid sequences, like AlphaFold3 but open source
Chai-1: models 3D conformations of proteins, small molecules, DNA, RNA; MSA-free competitor to AlphaFold3 (MSA takes up a lot of compute)
Clair3: deep learning variant caller for long and short read sequences, enables interaction modeling, docking, multi-molecule complex design
DiffDock: diffusion model docking that models small molecule-protein docking, drug screening
Distributed Bio: antibody discovery and design (not technically a foundation model, but they will be forced to embed their data for others to access)
DNA-BERT2: transformer model trained on DNA k-mers which allows motif finding, variant effect prediction, sequence annotation
ESM-3: protein LLM with embeddings for structure/function prediction, design, foldability analysis
Enformer: transformer on long DNA to predict gene expression
Evo2: protein LLM trained on evolutionary data, predict function from embeddings
GEARS: a method that can predict transcriptional response to single + multi-gene perturbations using scRNA-seq data from perturbational screens
Geneformer: transformer over millions of single cell profiles for cell-state modeling, perturbation predictions
GenSLM: LLM for genomic sequences for genomic modeling, variant generation
GraphBAN: graph attention for compound-protein interactions, useful for binding affinity prediction
HyenaDNA: long context DNA transformer
Nucleotide Transformer: multi-species DNA/RNA foundation model, cross-species genomics
Nucleus: DNA interpretation company
Pleiades: human whole epigenome modeling
ProteinMPNN: structure-conditioned sequence design
ProtGPT2: GPT-2 decoder-only transformer model pre-trained on proteins, enables generative protein design
Prov-Gigapath: whole-slide foundation model for digital pathology
RFDiffusion: diffusion model for protein design for binder generation, foldable protein scaffolds
RoseTTAFold: 3-track folding model, used in structure prediction + design
scFoundation: large-scale foundation model on cingle-cell transcriptomics
scGPT: single-cell multi-omics
scDesign3: single cell RNA-seq simulator, enables synthetic data generation, benchmarking, power analysis
scVI/scANVI: VAE trained unsupervised on raw scRNA-seq counts, meaning
scANVI: semi-supervised annotation
TOTALVI: RNA + protein integration
PEACOCK: multi-modal assay integration