Cost modeling DL vs linear regression paper

Cost modeling DL vs linear regression paper

In this article, I calculate the approximate costs of this paper. In the paper, they benchmark the performance of scGPT, scFoundation, scBERT, Geneformer, and UCE against GEARS, CPA, and deliberately simple baselines. Because scBERT, Geneformer, and UCE were not designed explicitly for this purpose, they are repurposed for it by combining them with a linear decoder that maps the cell embeddings to the gene expression space.

Due to training costs, I did not execute on this validation, so it’s more of an informatics and opinion piece.

Begin

I believe the paper is actually pretty well done. You can nit pick about formulas, missing information, and model intention, etc, but the main thing that the paper does miss is potential- the datasets they used are so small, they are almost useless. For reference, their dataset size was ~22GB across:

  1. CRISPRa Perturb‑seq (Norman et al. 2019): 3.9GB
  2. CRISPRi Perturb‑seq (Adamson et al. 2016): 0.9GB
  3. K562 essential gene screen (Replogle et al. 2022): 9.9GB
  4. RPE1 essential gene screen (Replogle et al. 2022): 8.1 GB

I include an equivalent training cost breakdown for GPT-4o (arguably their best model from a consumer standpoint), which, as we know with text and video, is so much smaller than biological data. This does not mean that it is less information-dense (as we are finding from genetic models), but that it will require different data structuring + modeling techniques.

The size of bio data is huge

I can’t even imagine how much data is going to be needed to actually train a commercial foundation model for any biological scale, let alone for all humans- RNA, DNA, scRNA, imaging, neuron tracking, etc.

It’ll be at least PB-ZB’s of data, even with quantization, billions of dollars, and will involve way more compute than we expect (since storage currently isn’t emerging as the bottleneck, but may). Regardless, I break down the experiment in Part 1 and a commercial bio foundation model in Part 2.

1. Paper summary

Fine-tuning compute costs (4 A100’s) from the paper

The authors fine-tuned several models on four perturbation datasets. All training was done on NVIDIA A100 80GB GPUs, probably in some combination of the Broad Institute, MIT, and Harvard’s computers.

Fine-tuning the transformer-based scFoundation and scGPT were intensive, so their training was capped at ~3 days on the A100 GPUs, which approximately corresponds to ~5 epochs for scFoundation, and ~15 epochs for scgGPT. Each dataset was run with 5 independent train/test splits, meaning 20 separate fine-tuning runs per model.

GEARS, CPA, Geneformer, scBERT, and UCE were cheaper and smaller. I have no idea how much compute was used for these.

Linear and mean-expression predictors, PCA feature extraction, random forest, and kNN regressors most likely ran on CPUs with SIMD and with minimal GPU use. Also no idea how much these cost, but this could be more accurately calculated starting from the data and working backwards.

This is a rough calculation, but ~1,500-2,000 GPU-hours were consumed for the benchmarks, which accounts for ~4 datasets × 2 foundation models × 5 runs each, plus additional overhead for other models and repeated evaluations. On the Norman dataset (81k cells, 19k genes), fine-tuning + inference for each split probably took many hours (plotted on a log-scale) and consumed tens of GB of GPU memory.

To run these benchmarks on major cloud computing companies, it would take roughly ~1.5k GPU-hours on A100 80GB GPUs, and would cost:

~$2.50/hour x ~1500-2000 hours of A100 time = $3,750-$5,000

Altogether, not that bad! The question is from what new experience or efficiency improvement could they earn a return on that amount to profitably scale it…

Training + fine-tuning cost with 4 A100 GPUs:

Provider
Contract
$/GPU-hr
Hours (GPU-h)
Total Cost
Lambda Labs
On-Demand
$1.79
1,500
$2,685
Reserved
$1.40
1,500
$2,100
Spot
$0.90
1,500
$1,350
AWS
On-Demand
$2.74
1,500
$4,110
Reserved
$1.92
1,500
$2,880
Spot
$1.60
1,500
$2,400
Azure
On-Demand
$3.00
1,500
$4,500
Reserved
$2.10
1,500
$3,150
Spot
$1.20
1,500
$1,800
GCP
On-Demand
$2.93
1,500
$4,395
Reserved
$2.00
1,500
$3,000
Spot
$0.73
1,500
$1,095

Fine-tuning the Models

Below are summarized steps on how to fine-tune each model. Before fine-tuning, however, the authors had to prep. The code that follows in this section is not complete, nor is it tested in a code environment (yet).

  1. Common prep: pseudobulk, splits, top-genes
import scanpy as sc
import numpy as np
from sklearn.model_selection import KFold

adata = sc.read_h5ad("dataset.h5ad")

# pseudobulk per condition (guide/drug)
pb = adata.to_df().groupby(adata.obs["condition"]).mean()

# filter by top-1k genes
genes_use = pb.var(axis=0).sort_values(ascending=False).index[:1000]  
X = pb[genes_use].values.astype(np.float32)           # [n_cond, n_genes]
conds = pb.index.values

# five folds for robust estimates (from paper)
kf = KFold(n_splits=5, shuffle=True, random_state=1)
  1. Create a bilinear ridge decoder (the simple model used as baseline)
import torch, torch.nn as nn

# their simple model that is used as a linear baseline against DL
class BilinearRidge(nn.Module):
    def __init__(self, d_gene, d_cond, lam=1e-1):
        super().__init__()
        self.W = nn.Parameter(torch.zeros(d_gene, d_cond))
        self.b = nn.Parameter(torch.zeros(d_gene))
        self.lam = lam
    def forward(self, G, P):            # G:[n_genes,dg], P:[n_cond,dp]; use dg==d_gene, dp==d_cond
        return (G @ self.W @ P.T).T + self.b    # -> [n_cond, n_genes]

def fit_ridge(G, P, Y, lam=1e-1, iters=2000, lr=1e-2):
    model = BilinearRidge(G.shape[1], P.shape[1], lam)
    opt = torch.optim.Adam(model.parameters(), lr=lr)
    Yt = torch.tensor(Y); Gt = torch.tensor(G); Pt = torch.tensor(P)
    for _ in range(iters):
        pred = model(Gt, Pt)
        loss = ((pred - Yt)**2).mean() + model.lam*(model.W**2).mean()
        opt.zero_grad(); loss.backward(); opt.step()
    return model

I. scGPT

  1. Load the pretrained scGPT checkpoint
  2. Tokenize your single-cell expression matrix as input for the model
  3. Fine-tune the model on your perturbation dataset for ~15 epochs (via default tutorial pipeline)
  4. Use the generative head to predict gene expression for held-out perturbations
from scgpt import SCGPTModel, tokenize_counts  # schematic
ckpt = "scgpt_pretrained.ckpt"
model = SCGPTModel.load_from_checkpoint(ckpt)
tok = tokenize_counts(adata[:, genes_use])      # returns token IDs per cell

# fine-tune ~15 epochs, then aggregate to condition predictions
trainer = torch.optim.Adam(model.parameters(), lr=1e-4)
for epoch in range(15):
    for batch in dataloader(tok):               # your DataLoader
        loss = model.loss(batch)
        trainer.zero_grad(); loss.backward(); trainer.step()

# derive gene embeddings G_scgpt (ie last-layer weights or learned token embeddings)
G_scgpt = model.gene_embeddings(weight_only=True)[genes_use]   # [n_genes, d_g]

II. scFoundation

  1. Download the released scFoundation checkpoint
  2. Expose the position embedding layer (pos_emb.weight) for updating
  3. Fine-tune that embedding layer for ~5 epochs per dataset (capped at ~3 days total on an A100)
  4. Use the model to predict perturbation responses on test conditions
from scfoundation import load_model   # schematic API
sf = load_model("scfoundation.ckpt")
for p in sf.parameters(): p.requires_grad_(False)
sf.pos_emb.weight.requires_grad_(True)

opt = torch.optim.Adam([sf.pos_emb.weight], lr=3e-4)
for epoch in range(5):                 # or stop at 72h wall-time
    for batch in dataloader_sf(adata):
        loss = sf.loss(batch)
        opt.zero_grad(); loss.backward(); opt.step()

G_scf = sf.export_gene_embeddings(genes_use)    # [n_genes, d_g]

III. GEARS

  1. Compute pathway membership or perturbation graphs from your dataset
  2. Generate perturbation embeddings via spectral embedding
  3. Feed these embeddings into the simple bilinear regression decoder
  4. Train only the decoder (GEARS features remain fixed)
import networkx as nx
from sklearn.manifold import SpectralEmbedding

# build perturbation graph (shared pathway membership, edges weighted)
G = nx.Graph()
for pert in conds: G.add_node(pert)
# add edges with weights (ie Jaccard of pathway sets)
# G.add_edge(p1, p2, weight=w)

A = nx.to_scipy_sparse_matrix(G, nodelist=conds, weight="weight")
P_gears = SpectralEmbedding(n_components=64, affinity='precomputed').fit_transform(A)  # [n_cond, 64]

IV. CPA (Compositional Perturbation Autoencoder)

  1. Initialize CPA architecture (autoencoder with latent condition/perturbation embeddings)
  2. Train end-to-end on your dataset to learn perturbation and condition embeddings jointly
  3. Decode embeddings into gene expression predictions for unseen perturbations
  4. Evaluate against ground truth test sets
from cpa import CPA  # schematic

cpa = CPA(n_genes=len(genes_use), d_latent=128)
opt = torch.optim.Adam(cpa.parameters(), lr=1e-3)
train_data = make_cpa_dataset(adata[:, genes_use], adata.obs["condition"])

for epoch in range(50):
    for xb, cb in train_data:                  # xb: counts, cb: condition ids
        recon, z = cpa(xb, cb)
        loss = cpa.loss(recon, xb) + cpa.reg(z)
        opt.zero_grad(); loss.backward(); opt.step()

# predict held-out conditions:
Y_pred = cpa.predict(conditions_test, genes_use)   # pseudobulk per condition

V. Geneformer / scBERT / UCE

  1. Load the pretrained transformer model
  2. Extract fixed gene embeddings from its layers (no fine-tuning)
  3. Insert these embeddings into the bilinear regression decoder
  4. Train only the decoder weights (ridge regression) to map to expression output
# frozen feature extractors
from geneformer import Geneformer
gf = Geneformer.from_pretrained("geneformer.ckpt").eval()
with torch.no_grad():
    G_gf = gf.embed_genes(genes_use)  # [n_genes, d_g]

# or scBERT/UCE analogs: extract per-gene vectors; keep frozen

Post-fine-tune; bring it all together

# assemble training: embeddings -> bilinear ridge -> predict
def run_split(train_idx, test_idx, G_embed, P_embed):
    Y = X                                         # [n_cond, n_genes]
    Y_tr, Y_te = Y[train_idx], Y[test_idx]
    P_tr, P_te = P_embed[train_idx], P_embed[test_idx]
    model = fit_ridge(G_embed, P_tr, Y_tr, lam=1e-1, iters=1500, lr=5e-3)
    Y_hat = model(torch.tensor(G_embed), torch.tensor(P_te)).detach().numpy()
    return Y_te, Y_hat

# example: scFoundation genes + GEARS perts
Y_true, Y_pred = run_split(tr_idx, te_idx, G_scf, P_gears)


# calculate quality metrics
from scipy.stats import pearsonr
def rmse(a,b): return np.sqrt(((a-b)**2).mean())

rmse_score = rmse(Y_true, Y_pred)
pearson_delta = np.mean([pearsonr(t, p)[0] for t, p in zip(Y_true, Y_pred)])

# calculate mean baseline
Y_mean = X[tr_idx].mean(axis=0, keepdims=True).repeat(len(te_idx), axis=0)
rmse_mean = rmse(X[te_idx], Y_mean)

This is the basic flow of the paper- curate pseudobulk, craft G/P embeddings (PCA or pretrained), fit a bilinear ridge decoder, then score on held-out perturbations. Swap the embedding blocks (scGPT, scFoundation, GEARS, CPA, Geneformer/scBERT/UCE) without changing the decoder scaffolding.

Wet Lab Costs

Finally, we can approximate the perturb-seq/CROP-seq single-cell CRISPR screening datasets (from Norman et al. 2019, Adamson et al. 2016, Replogle et al. 2022).

These experiments involve performing pooled CRISPR perturbations and single-cell RNA sequencing on tens of thousands of cells. Large-scale perturb-seq screens can cost $100M+. There are attempts to cut it 20x, but that will still cost $10k+ if we wish to replicate.

This is a total guess using research, but by using 10x Genomics, we can theoretically achieve costs on the order of $0.25–$0.70 per cell by multiplexing samples and super-loading droplet channels. Recovering ~20,000 single cells runs about $5k–$15k in sequencing and library prep fees. The Norman et al. perturb-seq dataset had ~81,000 cells, implying on the order of $10k+ in sequencing consumables alone (assuming standard depths and techniques).

I am not going to spend the money on verifying this, but it’s nice to know. We can break that down to $/byte, which would be an interesting metric to track over time for digital biology. This doesn’t account for meaning, though, which is the most valuable metric, but needs contextualization in the market that it serves.

2. Training an actual foundation model

In this section, we will model the costs of training an actual foundation model (just based on cloud rates and data amount- this is very simplistic). Below is what it took to train GPT4o.

~20–40 TB of raw text (≈5–10T tokens, 200B parameters)

  • ~6×10²⁴ FLOPs → ~3 weeks on 10k H100s

~100TB+ → ~1PB range with vision/audio

  • 1 PB = ~3×10²⁶ FLOPs → ~3 years on 10k H100s

= trillions of tokens × training passes = billions in GPU-hours

For the exercise below, assume that we are using 10,000 H100’s with ~200B active params.

We have 3 levels:

  1. 22GB (~5.5B tokens)
  2. 40TB text (~5T tokens) *used as baseline
  3. 1PB (~250T tokens)
  1. 22 GB corpus (dataset size from paper)
Provider
Contract
$/GPU-hr
Hours (Days)
Total
Lambda Labs
On-Demand
$3.29
0.55 h (0.02 d)
$18,095
Reserved
$2.50
0.55 h (0.02 d)
$13,750
Spot
$1.50
0.66 h (0.03 d)
$9,900
AWS
On-Demand
$7.57
0.55 h (0.02 d)
$41,635
Reserved
$5.30
0.55 h (0.02 d)
$29,150
Spot
$5.00
0.74 h (0.03 d)
$37,000
Azure
On-Demand
$12.29
0.55 h (0.02 d)
$67,595
Reserved
$8.50
0.55 h (0.02 d)
$46,750
Spot
$3.60
0.83 h (0.03 d)
$29,880
GCP
On-Demand
$11.06
0.55 h (0.02 d)
$60,830
Reserved
$7.50
0.55 h (0.02 d)
$41,250
Spot
$2.25
0.74 h (0.03 d)
$16,650
  1. 40 TB corpus (GPT-4o text only)
Provider
Contract
$/GPU-hr
Hours (Days)
Total
Lambda Labs
On-Demand
$3.29
504 h (21 d)
$16,581,600
Reserved
$2.50
504 h (21 d)
$12,600,000
Spot
$1.50
604.8 h (25.2 d)
$9,072,000
AWS
On-Demand
$7.57
504 h (21 d)
$38,152,800
Reserved
$5.30
504 h (21 d)
$26,712,000
Spot
$5.00
672 h (28 d)
$33,600,000
Azure
On-Demand
$12.29
504 h (21 d)
$61,941,600
Reserved
$8.50
504 h (21 d)
$42,840,000
Spot
$3.60
756 h (31.5 d)
$27,216,000
GCP
On-Demand
$11.06
504 h (21 d)
$55,742,400
Reserved
$7.50
504 h (21 d)
$37,800,000
Spot
$2.25
672 h (28 d)
$15,120,000
  1. 1 PB corpus (full multi-modal foundation model)
Provider
Contract
$/GPU-hr
Hours (Days/Years)
Total
Lambda
On-Demand
$3.29
25,228.8 h (1,051.2 d / 2.88 y)
$830,027,520
Reserved
$2.50
25,228.8 h (1,051.2 d / 2.88 y)
$630,720,000
Spot
$1.50
30,274.6 h (1,261.4 d / 3.46 y)
$454,118,400
AWS
On-Demand
$7.57
25,228.8 h (1,051.2 d / 2.88 y)
$1,909,820,160
Reserved
$5.30
25,228.8 h (1,051.2 d / 2.88 y)
$1,337,126,400
Spot
$5.00
33,638.4 h (1,401.6 d / 3.84 y)
$1,681,920,000
Azure
On-Demand
$12.29
25,228.8 h (1,051.2 d / 2.88 y)
$3,100,619,520
Reserved
$8.50
25,228.8 h (1,051.2 d / 2.88 y)
$2,144,448,000
Spot
$3.60
37,843.2 h (1,576.8 d / 4.32 y)
$1,362,355,200
GCP
On-Demand
$11.06
25,228.8 h (1,051.2 d / 2.88 y)
$2,790,305,280
Reserved
$7.50
25,228.8 h (1,051.2 d / 2.88 y)
$1,892,160,000
Spot
$2.25
33,638.4 h (1,401.6 d / 3.84 y)
$756,864,000

…yeah, that’s with B’s and years. Obviously they aren’t going to train for years (they’ll have to make up with more GPUs and way more optimization techniques), but they could take months, like the original LLM’s. So no wonder DL didn’t achieve the same results as a simple linear regression.

Conclusion

I could sit here and write about all the inconsistencies and flaws in both their approach and my own approach/calculation, but at the end of the day, the future is going to be built by builders, and that is just how it happens.

If there is a market to build these DL bio models (which there definitely is, considering how insane biotech costs have become and how inefficient the industry has become), any gain at cutting costs and increasing leverage is going to be built.

Like everything in history, there is nothing, then suddenly, there is everything. The same will happen here. And for good reason! The amount of inventions we can build on top of these is theoretically infinite!