Github repo: Embed GitHub
Cells as a Chemical Factory
There are two primary focuses in cell engineering:
- designing cells to output a specific molecule
- designing cells to respond specific ways in specific environments
The first market is much larger ($40-80B), and includes designing microbes for outputting chemicals via cells instead of a petrochemical/chemistry process, creating antibiotics from fermentation, and creating alternative food proteins. The latter industry is emerging, but much smaller ($5-10B). This includes PFAS/oil spill cleanup, microbiome therapeutics, environmental sensors, basically using cell to do things besides just outputting a desired chemical.
This project primarily focuses on the first application, though I have projects in the pipeline for the second application.
Ginkgo Bioworks
This project is designed specifically around Ginkgo’s business model + system. They are one of the first pioneers of trying to systemize the engineering of cells for desired behavior, and their business resembles a service/dev shop/CRO for doing this.
Their system, high level, is as follows:
- define the goal: Talk to the customer about their desired molecule to output, budget, timeline, legal requirements, environmental ambitions (if applicable)
- Ginkgo team will then decide the host chassis cell line to make this, the types of edits required, the environment to grow it in, and the titer target to make this economically feasible
- design edits: Based on the organism, molecule, and environmental conditions (temperature, bioreactor, feedstock, antibiotics to prevent the production cells from dying), they’ll discuss and experiment with the different edits that need to be made
- synthesize strain: After designing + simulating, they’ll need to actually synthesize the genetic strands that their cells will absorb + start producing the desired output
- Test: They need to test the edits, and to do this, they use bioreactors that measure product titer (g/L), growth (OD), and rate (g/L*h)
- analyze: They will most likely run replicates in parallel (n=3-8), normalize to controls (depending on the customer demands), and probably use an analysis of variance (ANOVA) or some linear mixed models
- iterate: Based on analysis, they will iterate and improve
- delivery: Ginkgo then delivers the optimized microbe, a full tech package including the strain genome, pathway annotations, growth protocols, production titers, safety data, full DBTL history, metabolic reports, and predicted vs actual yield curves
A commercial example of this whole process is Motif FoodWorks, which ChatGPT told me took about 6 months. I would assume that the bulk of this was in the iteration phase, waiting for cells to show result.
Our Project
We will do all what Ginkgo does, but inside the computer. Ideally there would be a real world interface (through OpenTrons or others) that could handle the wet lab part automatically. We will be missing loads of unaccounted information in our models, but this is just a proof of concept for how the industry is going to change.
- define target molecule and desired yield
- select host strain
- design pathway
- design genetic circuit
- construct strain
- experimental DBTL loop- test in simulated environment
Reproduce
- Download LASER dataset
- contains real data of gene edits → yield
- Train EditScorer (supervised) on LASER data
- GNN encoder built on the BiGG model graph (iML1515) to compute node embeddings
- input:
gene_edits
- output:
predicted_yield
- supervision:
final_yield
from LASER - MLP scorer maps edit embeddings to predicted yield
- loss is calculated as MSE between GNN predicted yield and LASER yield
- Load BiGG model of organism
- iML1515.json for E Coli
- used for graph structure and COBRA simulation
- Train RL agent
- environment: applies edits → runs
cobra_model_cp.optimize()
to simulate yields on top of BiGG iML1515 E. Coli model - policy: GNN → softmax(logits)
- reward:
0.7 * cobra_yield + 0.3 * scorer_yield
- loss:
-reward * log π(action)
- updates: gradients flow through
loss.backprop()
into GNN + scorer - Eval edits
- sample through random KO/OE combinations
Notes
- we use BiGG’s graph model of E. Coli as the input topology for the GNN (that’s what ‘built on’ means- it’s passed as input to the GNN)
- we use the COBRA simulation of the BiGG model as the reward environment for RL
- the edit → yield logic came from the laser pre-training
- we use it as a comparison to the predicted output of the model we trained
- computing the linear programming problems for COBRA is sequential, thus CPU bound, and takes way longer than the GPU training (went from 70/30 COBRA to 90/10 model)
- we could parallelize the COBRA simulations but honestly this is just a practice project so no, we’re done
Math
How Frances learns to design better strains: one full forward + backward training step in the reinforcement learning loop
- GNN embedding: We start with a graph of the cell's metabolism (E. coli from BiGG). The GNN computes an embedding for each reaction, capturing how it's connected to the rest of the network.
- MLP scorer: Each reaction’s embedding is passed to an MLP, which outputs a logit score, showing how promising it is to knock out that reaction.
- Action sampling: We apply a softmax over the logits to get a probability distribution over all possible edits. In Frances, we sample 3 reaction knockouts from this distribution. These become the actions it takes. math:
- COBRA simulation: The selected knockouts are applied to a copy of the metabolic model, and COBRA simulates the result (via FBA). This returns a predicted yield (product output). math:
- Loss: We use the reward (yield) to compute a reinforce loss, which increases the probability of selecting edits that led to high yield. math:
- Backprop: The loss is backpropogated through the MLP scorer and GNN to improve the parameters. math:
math:
code:
math:
code:
code:
code:
code:
code:
Results
Training the RL agent and using COBRA simulations is too long on the CPU- we originally started with 70% COBRA and 30% model, then went to 10% COBRA and 90% model for the reward function. It was rated at taking 3 days to train on 24 thread 13th gen Intel, vs 3 hours on a $6/hour AWS 128 thread CPU=$18, but I’m not willing to spend money on this yet. I’m not knowledgable enough, but will be over the next few models, to gauge what a good program is to deploy. I increased the step count to 2000 and we are targeting for the whole agent to be trained in 1 hour (400 in 2.5hours- stalled due to memory leak, fixed with gc.collect()
and cache cleanup).
What’s Happening:
- early success: RL agent found perfect strategies early (reward = 1.000)
- exploration decay: As training progressed, the agent explored more random strategies
- negative rewards: Most recent strategies are getting very negative rewards (bad)
- comparison: Random baseline (probably ~0.1) vs RL average (−0.94) = −125% (very bad)
The Fix:
A lot- our RL agent needs a better exploration strategy, intense reward function tuning, learning rate adjustment, and more stable training. Overall the whole architecture needs to be redone. But, this is a project to learn, and we’ll implement the learnings in the future.
Commercial Potential
Though this specific project is crude and poorly constructed, an ensemble of these models + capsid design would greatly increase the leverage of or entirely replace the jobs below:
Also why aren’t there any libraries for parallelized COBRA simulations? It is approximating 3 days for 10,000 simulations (which isn’t even that many). There is probably an opportunity for parallelizing these model simulations for generative synthetic biology gene editing + yield production. It’ll be interesting to see fully vertically integrate organism + hardware feedback loops, with a model in the loop handling all the computation, like Tesla Autopilot.