9/23/25
ΦX174 (cryoEM renderings from the paper)
Today, we are recreating the paper “Generative design of novel bacteriophages with genome language models” from the Arc Institute. I am assuming the reader has an advanced understanding of molecular biology and programming. Only some terms will be explained.
Evo2 was trained on millions of phage genomes. The authors fine-tuned this model on ~15,000 Microviridae genomes (the family that includes ΦX174). They then used the model to produce 302 novel genomes that they then test and validate in the wet lab. Out of these, 16 prove to be valid and infectious.
Method steps
- get the Microviridae genome dataset from the NCBI’s Viral Genomes database, filter non-ACGT and garbage, cluster at 99% to deduplicate with MMseq2, end up with 14,466 genomes (data)
- align each genome to ΦX174, assign identity bins, prepend soft-prompt tokens (“+” (Microviridae) plus one of the below tokens that represents percentage identity, meaning how close that genome is to Φ174 via identity bin ({∼: 95-100%, ˆ:80-95%, #: 70-80%, $: 50-70%, !: <50%})
- split dataset into 14,266 train/100 test/100 validate
- build dataset + data loader (one sample = 1 whole genome + tokens, truncate/pad to 10,240 tokens, mask pads in loss)
- train
- Evo1 7B 131K
- 5,000 steps on 16 H100’s
- 64 batch size
- 10,240 token context length
- ~655k tokens processed per step
- LR 9.698e-5 with 5% warmup then cosine → 3e-5 (start at 0, ramp to 0.00009698 over first 5% of steps, then cosine-decay to 0.00003 by the end)
- Evo2 7B 8K
- 12,000 steps on 32 H100’s
- 32 batch size
- 10,240 token context length
- ~328k tokens processed per step
- LR 1e-5 with 5% warmup then cosine → 1e-6 (start at 0, ramp to 0.00001 over first 5% of steps, then cosine-decay to 0.000001 by the end)
- why is it like that? no idea, assuming they A/B tested a bunch
- generate ~1,000 sampled genomes per temperature x prompt combo, with 5 temps ({0.3, 0.5, 0.7, 0.9, 1.1}) and 11 prompt lengths (1–11 nucleotides of the ΦX174 consensus start) ⇒ ~55 combos (5 x 11) ⇒ ~55,000 genomes per model, meaning 110,000 genomes
- filter those 110,000 generated genomes based on QC (length 4–6 kb, GC%), tropism (spike identity ≥60%), gene/architecture checks, maximized diversity (Shannon diversity), and end up with 302 candidate genomes
- send 302 fragments to Twist, who then assembles + synthesizes them into circular double-stranded DNA genomes with NEBuilder/Gibson at ~50 °C for ~1 h (since ΦX174 is naturally single-stranded, but replication needs double stranded DNA)
- boot the phage by transforming the circular DNA into E. Coli C via electroporation/chem-comp, let them grow back their walls with a rich media, temperature, and time
- check results by plating via plaque assay and OD600 growth assay
- plaque: if the cells release working phages, clear circles (plaques) will appear where bacteria were killed (each plaque = one infectious phage particle)
- OD600: grow E. Coli in liquid with the strain of phage in question, use a spectrophotometer to measure; if the phage is viable, the bacterial culture will drop in optical density (cells die/lyse)
- pick winners and amplify by infecting fresh E. coli C, grow it, spin/filter supernatant to get phage stock
- quantify the amount of phage with serial dilutions + plating to get PFU/mL (titer)
- verify the sequence of the phage genome in question (Plasmidsaurus long-read sequencing)
- B.2.9, B.5.1, B.5.6, B.5.7 → has the technology used
- characterize
- host range: test on E. coli C, W, and K-12; record which strains are lysed
- lysis kinetics: OD600 vs time- report minimum optical density, time-to-min, and drop rate (
dx/dy
) - head-to-head fitness: equal MOI mix with ΦX174, passage once; quantify relative abundance by sequencing (section B.x.x)
- resistance assay: use ΦX174-resistant strains, serially passage with single phage or cocktail, note passages to clearance/failure
Result: 285/302 assembled, 16 truly viable and infectious
What’s the cost of compute?
Assumptions
Evo1 SFT: 5,000 steps x 16 H100s
Evo2 SFT: 12,000 steps x 32 H100s
Lambda Labs $2.49/hour on-demand instance price
cost = (# GPUs x steps x sec/step / 3600) x $/GPUh
0.25 sec/step
- Evo1: 5.6 GPUh = $14
- Evo2: 26.7 GPUh = $66
- total: 32.2 GPUh = $80
0.50 sec/step
- Evo1: 11.1 GPUh = $28
- Evo2: 53.3 GPUh = $133
- total: 64.4 GPUh = $160
1.00 sec/step
- Evo1: 22.2 GPUh = $55
- Evo2: 106.7 GPUh = $266
- total: 128.9 GPUh = $321