In this article, I calculate the approximate costs of this paper. In the paper, they benchmark the performance of scGPT, scFoundation, scBERT, Geneformer, and UCE against GEARS, CPA, and deliberately simple baselines. Because scBERT, Geneformer, and UCE were not designed explicitly for this purpose, they are repurposed for it by combining them with a linear decoder that maps the cell embeddings to the gene expression space.

Due to training costs, I did not execute on this validation, so it’s more of an informatics and opinion piece.

Begin

I believe the paper is actually pretty well done. You can nit pick about formulas, missing information, and model intention, etc, but the main thing that the paper does miss is potential- the datasets they used are so small, they are almost useless. For reference, their dataset size was ~22GB across:

CRISPRa Perturb‑seq (Norman et al. 2019): 3.9GB
CRISPRi Perturb‑seq (Adamson et al. 2016): 0.9GB
K562 essential gene screen (Replogle et al. 2022): 9.9GB
RPE1 essential gene screen (Replogle et al. 2022): 8.1 GB

I include an equivalent training cost breakdown for GPT-4o (arguably their best model from a consumer standpoint), which, as we know with text and video, is so much smaller than biological data. This does not mean that it is less information-dense (as we are finding from genetic models), but that it will require different data structuring + modeling techniques.

The size of bio data is huge

I can’t even imagine how much data is going to be needed to actually train a commercial foundation model for any biological scale, let alone for all humans- RNA, DNA, scRNA, imaging, neuron tracking, etc.

It’ll be at least PB-ZB’s of data, even with quantization, billions of dollars, and will involve way more compute than we expect (since storage currently isn’t emerging as the bottleneck, but may). Regardless, I break down the experiment in Part 1 and a commercial bio foundation model in Part 2.

1. Paper summary

Fine-tuning compute costs (4 A100’s) from the paper

The authors fine-tuned several models on four perturbation datasets. All training was done on NVIDIA A100 80GB GPUs, probably in some combination of the Broad Institute, MIT, and Harvard’s computers.

Fine-tuning the transformer-based scFoundation and scGPT were intensive, so their training was capped at ~3 days on the A100 GPUs, which approximately corresponds to ~5 epochs for scFoundation, and ~15 epochs for scgGPT. Each dataset was run with 5 independent train/test splits, meaning 20 separate fine-tuning runs per model.

GEARS, CPA, Geneformer, scBERT, and UCE were cheaper and smaller. I have no idea how much compute was used for these.

Linear and mean-expression predictors, PCA feature extraction, random forest, and kNN regressors most likely ran on CPUs with SIMD and with minimal GPU use. Also no idea how much these cost, but this could be more accurately calculated starting from the data and working backwards.

This is a rough calculation, but ~1,500-2,000 GPU-hours were consumed for the benchmarks, which accounts for ~4 datasets × 2 foundation models × 5 runs each, plus additional overhead for other models and repeated evaluations. On the Norman dataset (81k cells, 19k genes), fine-tuning + inference for each split probably took many hours (plotted on a log-scale) and consumed tens of GB of GPU memory.

To run these benchmarks on major cloud computing companies, it would take roughly ~1.5k GPU-hours on A100 80GB GPUs, and would cost:

~$2.50/hour x ~1500-2000 hours of A100 time = $3,750-$5,000

Altogether, not that bad! The question is from what new experience or efficiency improvement could they earn a return on that amount to profitably scale it…

Training + fine-tuning cost with 4 A100 GPUs:

Provider	Contract	$/GPU-hr	Hours (GPU-h)	Total Cost
Lambda Labs	On-Demand	$1.79	1,500	$2,685
	Reserved	$1.40	1,500	$2,100
	Spot	$0.90	1,500	$1,350
AWS	On-Demand	$2.74	1,500	$4,110
	Reserved	$1.92	1,500	$2,880
	Spot	$1.60	1,500	$2,400
Azure	On-Demand	$3.00	1,500	$4,500
	Reserved	$2.10	1,500	$3,150
	Spot	$1.20	1,500	$1,800
GCP	On-Demand	$2.93	1,500	$4,395
	Reserved	$2.00	1,500	$3,000
	Spot	$0.73	1,500	$1,095

Fine-tuning the Models

Below are summarized steps on how to fine-tune each model. Before fine-tuning, however, the authors had to prep. The code that follows in this section is not complete, nor is it tested in a code environment (yet).

Common prep: pseudobulk, splits, top-genes

import scanpy as sc
import numpy as np
from sklearn.model_selection import KFold

adata = sc.read_h5ad("dataset.h5ad")

# pseudobulk per condition (guide/drug)
pb = adata.to_df().groupby(adata.obs["condition"]).mean()

# filter by top-1k genes
genes_use = pb.var(axis=0).sort_values(ascending=False).index[:1000]  
X = pb[genes_use].values.astype(np.float32)           # [n_cond, n_genes]
conds = pb.index.values

# five folds for robust estimates (from paper)
kf = KFold(n_splits=5, shuffle=True, random_state=1)

Create a bilinear ridge decoder (the simple model used as baseline)

import torch, torch.nn as nn

# their simple model that is used as a linear baseline against DL
class BilinearRidge(nn.Module):
    def __init__(self, d_gene, d_cond, lam=1e-1):
        super().__init__()
        self.W = nn.Parameter(torch.zeros(d_gene, d_cond))
        self.b = nn.Parameter(torch.zeros(d_gene))
        self.lam = lam
    def forward(self, G, P):            # G:[n_genes,dg], P:[n_cond,dp]; use dg==d_gene, dp==d_cond
        return (G @ self.W @ P.T).T + self.b    # -> [n_cond, n_genes]

def fit_ridge(G, P, Y, lam=1e-1, iters=2000, lr=1e-2):
    model = BilinearRidge(G.shape[1], P.shape[1], lam)
    opt = torch.optim.Adam(model.parameters(), lr=lr)
    Yt = torch.tensor(Y); Gt = torch.tensor(G); Pt = torch.tensor(P)
    for _ in range(iters):
        pred = model(Gt, Pt)
        loss = ((pred - Yt)**2).mean() + model.lam*(model.W**2).mean()
        opt.zero_grad(); loss.backward(); opt.step()
    return model

I. scGPT

Load the pretrained scGPT checkpoint
Tokenize your single-cell expression matrix as input for the model
Fine-tune the model on your perturbation dataset for ~15 epochs (via default tutorial pipeline)
Use the generative head to predict gene expression for held-out perturbations

from scgpt import SCGPTModel, tokenize_counts  # schematic
ckpt = "scgpt_pretrained.ckpt"
model = SCGPTModel.load_from_checkpoint(ckpt)
tok = tokenize_counts(adata[:, genes_use])      # returns token IDs per cell

# fine-tune ~15 epochs, then aggregate to condition predictions
trainer = torch.optim.Adam(model.parameters(), lr=1e-4)
for epoch in range(15):
    for batch in dataloader(tok):               # your DataLoader
        loss = model.loss(batch)
        trainer.zero_grad(); loss.backward(); trainer.step()

# derive gene embeddings G_scgpt (ie last-layer weights or learned token embeddings)
G_scgpt = model.gene_embeddings(weight_only=True)[genes_use]   # [n_genes, d_g]

II. scFoundation

Download the released scFoundation checkpoint
Expose the position embedding layer (pos_emb.weight) for updating
Fine-tune that embedding layer for ~5 epochs per dataset (capped at ~3 days total on an A100)
Use the model to predict perturbation responses on test conditions

from scfoundation import load_model   # schematic API
sf = load_model("scfoundation.ckpt")
for p in sf.parameters(): p.requires_grad_(False)
sf.pos_emb.weight.requires_grad_(True)

opt = torch.optim.Adam([sf.pos_emb.weight], lr=3e-4)
for epoch in range(5):                 # or stop at 72h wall-time
    for batch in dataloader_sf(adata):
        loss = sf.loss(batch)
        opt.zero_grad(); loss.backward(); opt.step()

G_scf = sf.export_gene_embeddings(genes_use)    # [n_genes, d_g]

III. GEARS

Compute pathway membership or perturbation graphs from your dataset
Generate perturbation embeddings via spectral embedding
Feed these embeddings into the simple bilinear regression decoder
Train only the decoder (GEARS features remain fixed)

import networkx as nx
from sklearn.manifold import SpectralEmbedding

# build perturbation graph (shared pathway membership, edges weighted)
G = nx.Graph()
for pert in conds: G.add_node(pert)
# add edges with weights (ie Jaccard of pathway sets)
# G.add_edge(p1, p2, weight=w)

A = nx.to_scipy_sparse_matrix(G, nodelist=conds, weight="weight")
P_gears = SpectralEmbedding(n_components=64, affinity='precomputed').fit_transform(A)  # [n_cond, 64]

IV. CPA (Compositional Perturbation Autoencoder)

Initialize CPA architecture (autoencoder with latent condition/perturbation embeddings)
Train end-to-end on your dataset to learn perturbation and condition embeddings jointly
Decode embeddings into gene expression predictions for unseen perturbations
Evaluate against ground truth test sets

from cpa import CPA  # schematic

cpa = CPA(n_genes=len(genes_use), d_latent=128)
opt = torch.optim.Adam(cpa.parameters(), lr=1e-3)
train_data = make_cpa_dataset(adata[:, genes_use], adata.obs["condition"])

for epoch in range(50):
    for xb, cb in train_data:                  # xb: counts, cb: condition ids
        recon, z = cpa(xb, cb)
        loss = cpa.loss(recon, xb) + cpa.reg(z)
        opt.zero_grad(); loss.backward(); opt.step()

# predict held-out conditions:
Y_pred = cpa.predict(conditions_test, genes_use)   # pseudobulk per condition

V. Geneformer / scBERT / UCE

Load the pretrained transformer model
Extract fixed gene embeddings from its layers (no fine-tuning)
Insert these embeddings into the bilinear regression decoder
Train only the decoder weights (ridge regression) to map to expression output

# frozen feature extractors
from geneformer import Geneformer
gf = Geneformer.from_pretrained("geneformer.ckpt").eval()
with torch.no_grad():
    G_gf = gf.embed_genes(genes_use)  # [n_genes, d_g]

# or scBERT/UCE analogs: extract per-gene vectors; keep frozen

Post-fine-tune; bring it all together

# assemble training: embeddings -> bilinear ridge -> predict
def run_split(train_idx, test_idx, G_embed, P_embed):
    Y = X                                         # [n_cond, n_genes]
    Y_tr, Y_te = Y[train_idx], Y[test_idx]
    P_tr, P_te = P_embed[train_idx], P_embed[test_idx]
    model = fit_ridge(G_embed, P_tr, Y_tr, lam=1e-1, iters=1500, lr=5e-3)
    Y_hat = model(torch.tensor(G_embed), torch.tensor(P_te)).detach().numpy()
    return Y_te, Y_hat

# example: scFoundation genes + GEARS perts
Y_true, Y_pred = run_split(tr_idx, te_idx, G_scf, P_gears)


# calculate quality metrics
from scipy.stats import pearsonr
def rmse(a,b): return np.sqrt(((a-b)**2).mean())

rmse_score = rmse(Y_true, Y_pred)
pearson_delta = np.mean([pearsonr(t, p)[0] for t, p in zip(Y_true, Y_pred)])

# calculate mean baseline
Y_mean = X[tr_idx].mean(axis=0, keepdims=True).repeat(len(te_idx), axis=0)
rmse_mean = rmse(X[te_idx], Y_mean)

This is the basic flow of the paper- curate pseudobulk, craft G/P embeddings (PCA or pretrained), fit a bilinear ridge decoder, then score on held-out perturbations. Swap the embedding blocks (scGPT, scFoundation, GEARS, CPA, Geneformer/scBERT/UCE) without changing the decoder scaffolding.

Wet Lab Costs

Finally, we can approximate the perturb-seq/CROP-seq single-cell CRISPR screening datasets (from Norman et al. 2019, Adamson et al. 2016, Replogle et al. 2022).

These experiments involve performing pooled CRISPR perturbations and single-cell RNA sequencing on tens of thousands of cells. Large-scale perturb-seq screens can cost $100M+. There are attempts to cut it 20x, but that will still cost $10k+ if we wish to replicate.

This is a total guess using research, but by using 10x Genomics, we can theoretically achieve costs on the order of $0.25–$0.70 per cell by multiplexing samples and super-loading droplet channels. Recovering ~20,000 single cells runs about $5k–$15k in sequencing and library prep fees. The Norman et al. perturb-seq dataset had ~81,000 cells, implying on the order of $10k+ in sequencing consumables alone (assuming standard depths and techniques).

I am not going to spend the money on verifying this, but it’s nice to know. We can break that down to $/byte, which would be an interesting metric to track over time for digital biology. This doesn’t account for meaning, though, which is the most valuable metric, but needs contextualization in the market that it serves.

2. Training an actual foundation model

In this section, we will model the costs of training an actual foundation model (just based on cloud rates and data amount- this is very simplistic). Below is what it took to train GPT4o.

~20–40 TB of raw text (≈5–10T tokens, 200B parameters)

~6×10²⁴ FLOPs → ~3 weeks on 10k H100s

~100TB+ → ~1PB range with vision/audio

1 PB = ~3×10²⁶ FLOPs → ~3 years on 10k H100s

= trillions of tokens × training passes = billions in GPU-hours

For the exercise below, assume that we are using 10,000 H100’s with ~200B active params.

We have 3 levels:

22GB (~5.5B tokens)
40TB text (~5T tokens) *used as baseline
1PB (~250T tokens)

22 GB corpus (dataset size from paper)

Provider	Contract	$/GPU-hr	Hours (Days)	Total
Lambda Labs	On-Demand	$3.29	0.55 h (0.02 d)	$18,095
	Reserved	$2.50	0.55 h (0.02 d)	$13,750
	Spot	$1.50	0.66 h (0.03 d)	$9,900
AWS	On-Demand	$7.57	0.55 h (0.02 d)	$41,635
	Reserved	$5.30	0.55 h (0.02 d)	$29,150
	Spot	$5.00	0.74 h (0.03 d)	$37,000
Azure	On-Demand	$12.29	0.55 h (0.02 d)	$67,595
	Reserved	$8.50	0.55 h (0.02 d)	$46,750
	Spot	$3.60	0.83 h (0.03 d)	$29,880
GCP	On-Demand	$11.06	0.55 h (0.02 d)	$60,830
	Reserved	$7.50	0.55 h (0.02 d)	$41,250
	Spot	$2.25	0.74 h (0.03 d)	$16,650

40 TB corpus (GPT-4o text only)

Provider	Contract	$/GPU-hr	Hours (Days)	Total
Lambda Labs	On-Demand	$3.29	504 h (21 d)	$16,581,600
	Reserved	$2.50	504 h (21 d)	$12,600,000
	Spot	$1.50	604.8 h (25.2 d)	$9,072,000
AWS	On-Demand	$7.57	504 h (21 d)	$38,152,800
	Reserved	$5.30	504 h (21 d)	$26,712,000
	Spot	$5.00	672 h (28 d)	$33,600,000
Azure	On-Demand	$12.29	504 h (21 d)	$61,941,600
	Reserved	$8.50	504 h (21 d)	$42,840,000
	Spot	$3.60	756 h (31.5 d)	$27,216,000
GCP	On-Demand	$11.06	504 h (21 d)	$55,742,400
	Reserved	$7.50	504 h (21 d)	$37,800,000
	Spot	$2.25	672 h (28 d)	$15,120,000

1 PB corpus (full multi-modal foundation model)

Provider	Contract	$/GPU-hr	Hours (Days/Years)	Total
Lambda	On-Demand	$3.29	25,228.8 h (1,051.2 d / 2.88 y)	$830,027,520
	Reserved	$2.50	25,228.8 h (1,051.2 d / 2.88 y)	$630,720,000
	Spot	$1.50	30,274.6 h (1,261.4 d / 3.46 y)	$454,118,400
AWS	On-Demand	$7.57	25,228.8 h (1,051.2 d / 2.88 y)	$1,909,820,160
	Reserved	$5.30	25,228.8 h (1,051.2 d / 2.88 y)	$1,337,126,400
	Spot	$5.00	33,638.4 h (1,401.6 d / 3.84 y)	$1,681,920,000
Azure	On-Demand	$12.29	25,228.8 h (1,051.2 d / 2.88 y)	$3,100,619,520
	Reserved	$8.50	25,228.8 h (1,051.2 d / 2.88 y)	$2,144,448,000
	Spot	$3.60	37,843.2 h (1,576.8 d / 4.32 y)	$1,362,355,200
GCP	On-Demand	$11.06	25,228.8 h (1,051.2 d / 2.88 y)	$2,790,305,280
	Reserved	$7.50	25,228.8 h (1,051.2 d / 2.88 y)	$1,892,160,000
	Spot	$2.25	33,638.4 h (1,401.6 d / 3.84 y)	$756,864,000

…yeah, that’s with B’s and years. Obviously they aren’t going to train for years (they’ll have to make up with more GPUs and way more optimization techniques), but they could take months, like the original LLM’s. So no wonder DL didn’t achieve the same results as a simple linear regression.

Conclusion

I could sit here and write about all the inconsistencies and flaws in both their approach and my own approach/calculation, but at the end of the day, the future is going to be built by builders, and that is just how it happens.

If there is a market to build these DL bio models (which there definitely is, considering how insane biotech costs have become and how inefficient the industry has become), any gain at cutting costs and increasing leverage is going to be built.

Like everything in history, there is nothing, then suddenly, there is everything. The same will happen here. And for good reason! The amount of inventions we can build on top of these is theoretically infinite!

Cost modeling DL vs linear regression paper