9/23/25

ΦX174 (cryoEM renderings from the paper)

Today, we are recreating the paper “Generative design of novel bacteriophages with genome language models” from the Arc Institute. I am assuming the reader has an advanced understanding of molecular biology and programming. Only some terms will be explained.

Evo2 was trained on millions of phage genomes. The authors fine-tuned this model on ~15,000 Microviridae genomes (the family that includes ΦX174). They then used the model to produce 302 novel genomes that they then test and validate in the wet lab. Out of these, 16 prove to be valid and infectious.

Method steps

get the Microviridae genome dataset from the NCBI’s Viral Genomes database, filter non-ACGT and garbage, cluster at 99% to deduplicate with MMseq2, end up with 14,466 genomes (data)
align each genome to ΦX174, assign identity bins, prepend soft-prompt tokens (“+” (Microviridae) plus one of the below tokens that represents percentage identity, meaning how close that genome is to Φ174 via identity bin ({∼: 95-100%, ˆ:80-95%, #: 70-80%, $: 50-70%, !: <50%})

Evo1 code, Evo2 code

split dataset into 14,266 train/100 test/100 validate
build dataset + data loader (one sample = 1 whole genome + tokens, truncate/pad to 10,240 tokens, mask pads in loss)
train

Evo1 7B 131K

5,000 steps on 16 H100’s
64 batch size
10,240 token context length
~655k tokens processed per step
LR 9.698e-5 with 5% warmup then cosine → 3e-5 (start at 0, ramp to 0.00009698 over first 5% of steps, then cosine-decay to 0.00003 by the end)

Evo2 7B 8K

12,000 steps on 32 H100’s
32 batch size
10,240 token context length
~328k tokens processed per step
LR 1e-5 with 5% warmup then cosine → 1e-6 (start at 0, ramp to 0.00001 over first 5% of steps, then cosine-decay to 0.000001 by the end)

why is it like that? no idea, assuming they A/B tested a bunch

generate ~1,000 sampled genomes per temperature x prompt combo, with 5 temps ({0.3, 0.5, 0.7, 0.9, 1.1}) and 11 prompt lengths (1–11 nucleotides of the ΦX174 consensus start) ⇒ ~55 combos (5 x 11) ⇒ ~55,000 genomes per model, meaning 110,000 genomes
filter those 110,000 generated genomes based on QC (length 4–6 kb, GC%), tropism (spike identity ≥60%), gene/architecture checks, maximized diversity (Shannon diversity), and end up with 302 candidate genomes
send 302 fragments to Twist, who then assembles + synthesizes them into circular double-stranded DNA genomes with NEBuilder/Gibson at ~50 °C for ~1 h (since ΦX174 is naturally single-stranded, but replication needs double stranded DNA)
boot the phage by transforming the circular DNA into E. Coli C via electroporation/chem-comp, let them grow back their walls with a rich media, temperature, and time
check results by plating via plaque assay and OD600 growth assay

plaque: if the cells release working phages, clear circles (plaques) will appear where bacteria were killed (each plaque = one infectious phage particle)
OD600: grow E. Coli in liquid with the strain of phage in question, use a spectrophotometer to measure; if the phage is viable, the bacterial culture will drop in optical density (cells die/lyse)

pick winners and amplify by infecting fresh E. coli C, grow it, spin/filter supernatant to get phage stock
quantify the amount of phage with serial dilutions + plating to get PFU/mL (titer)
verify the sequence of the phage genome in question (Plasmidsaurus long-read sequencing)

B.2.9, B.5.1, B.5.6, B.5.7 → has the technology used

characterize

host range: test on E. coli C, W, and K-12; record which strains are lysed
lysis kinetics: OD600 vs time- report minimum optical density, time-to-min, and drop rate (dx/dy)
head-to-head fitness: equal MOI mix with ΦX174, passage once; quantify relative abundance by sequencing (section B.x.x)
resistance assay: use ΦX174-resistant strains, serially passage with single phage or cocktail, note passages to clearance/failure

Result: 285/302 assembled, 16 truly viable and infectious

What’s the cost of compute?

Assumptions

Evo1 SFT: 5,000 steps x 16 H100s

Evo2 SFT: 12,000 steps x 32 H100s

Lambda Labs $2.49/hour on-demand instance price

cost = (# GPUs x steps x sec/step / 3600) x $/GPUh

0.25 sec/step

Evo1: 5.6 GPUh = $14
Evo2: 26.7 GPUh = $66
total: 32.2 GPUh = $80

0.50 sec/step

Evo1: 11.1 GPUh = $28
Evo2: 53.3 GPUh = $133
total: 64.4 GPUh = $160

1.00 sec/step

Evo1: 22.2 GPUh = $55
Evo2: 106.7 GPUh = $266
total: 128.9 GPUh = $321

Fine-tuning Evo2 to generate novel phage designs

Method steps

What’s the cost of compute?

Assumptions