Foundation Models in Biology (1)

A foundation model is described as a model that has

massive, self-supervised pretraining on diverse data (math does the work)
transferable representations to be used by downstream machines (transfer learning, fine-tuning)

AlphaFold3: protein folding prediction based on amino acid sequences

AlphaGenome: predicts how single variants or mutations in human DNA sequences impact biological processes regulating genes

BigRNA: models RNA regulation, predicts tissue-specific RNA regulation like splicing, stability, binding, expression

Bioptimus Histology: vision transformer that transforms histopathology images into embeddings, enables biomarker prediction and tissue segmentation for treatment response

Blank.bio: models mRNA therapeutic design and prediction, like expression, stability, and manufacturability

Boltz2: protein folding prediction based on amino acid sequences, like AlphaFold3 but open source

Chai-1: models 3D conformations of proteins, small molecules, DNA, RNA; MSA-free competitor to AlphaFold3 (MSA takes up a lot of compute)

Clair3: deep learning variant caller for long and short read sequences, enables interaction modeling, docking, multi-molecule complex design

DiffDock: diffusion model docking that models small molecule-protein docking, drug screening

Distributed Bio: antibody discovery and design (not technically a foundation model, but they will be forced to embed their data for others to access)

DNA-BERT2: transformer model trained on DNA k-mers which allows motif finding, variant effect prediction, sequence annotation

ESM-3: protein LLM with embeddings for structure/function prediction, design, foldability analysis

Enformer: transformer on long DNA to predict gene expression

Evo2: protein LLM trained on evolutionary data, predict function from embeddings

GEARS: a method that can predict transcriptional response to single + multi-gene perturbations using scRNA-seq data from perturbational screens

Geneformer: transformer over millions of single cell profiles for cell-state modeling, perturbation predictions

GenSLM: LLM for genomic sequences for genomic modeling, variant generation

GraphBAN: graph attention for compound-protein interactions, useful for binding affinity prediction

HyenaDNA: long context DNA transformer

Nucleotide Transformer: multi-species DNA/RNA foundation model, cross-species genomics

Nucleus: DNA interpretation company

Pleiades: human whole epigenome modeling

ProteinMPNN: structure-conditioned sequence design

ProtGPT2: GPT-2 decoder-only transformer model pre-trained on proteins, enables generative protein design

Prov-Gigapath: whole-slide foundation model for digital pathology

RFDiffusion: diffusion model for protein design for binder generation, foldable protein scaffolds

RoseTTAFold: 3-track folding model, used in structure prediction + design

scFoundation: large-scale foundation model on cingle-cell transcriptomics

scGPT: single-cell multi-omics

scDesign3: single cell RNA-seq simulator, enables synthetic data generation, benchmarking, power analysis

scVI/scANVI: VAE trained unsupervised on raw scRNA-seq counts, meaning

scANVI: semi-supervised annotation

TOTALVI: RNA + protein integration

PEACOCK: multi-modal assay integration