Cells as a Chemical Factory

There are two primary focuses in cell engineering:

designing cells to output a specific molecule
designing cells to respond specific ways in specific environments

The first market is much larger ($40-80B), and includes designing microbes for outputting chemicals via cells instead of a petrochemical/chemistry process, creating antibiotics from fermentation, and creating alternative food proteins. The latter industry is emerging, but much smaller ($5-10B). This includes PFAS/oil spill cleanup, microbiome therapeutics, environmental sensors, basically using cells to do things besides just outputting a desired chemical.

This project primarily focuses on the first application, though I have projects in the pipeline for the second application.

Ginkgo Bioworks

This project is designed specifically around Ginkgo’s business model + system. They are one of the first pioneers of trying to systemize the engineering of cells for desired behavior, and their business resembles a service-based dev shop/CRO for doing this.

Their system, high level, is as follows:

Define the goal: Talk to the customer about their desired molecule to output, budget, timeline, legal requirements, environmental ambitions (if applicable)

The Ginkgo team will then decide the host chassis cell line to make this, the types of edits required, the environment to grow it in, and the titer target to make this economically feasible

Design edits: Based on the organism, molecule, and environmental conditions (temperature, bioreactor, feedstock, antibiotics to prevent the production cells from dying), they’ll discuss and experiment with the different edits that need to be made
Synthesize strain: After designing + simulating, they’ll need to actually synthesize the genetic strands that their cells will absorb + start producing the desired output
Test: They need to test the edits, and to do this, they use bioreactors that measure product titer (g/L), growth (OD), and rate (g/L*h)
Analyze: They will most likely run replicates in parallel (n=3-8), normalize to controls (depending on the customer demands), and probably use an analysis of variance (ANOVA) or some linear mixed models
Iterate: Based on analysis, they will iterate and improve
Delivery: Ginkgo then delivers the optimized microbe, a full tech package including the strain genome, pathway annotations, growth protocols, production titers, safety data, full DBTL history, metabolic reports, and predicted vs actual yield curves

A commercial example of this whole process is Motif FoodWorks, which ChatGPT told me took about 6 months. I would assume that the bulk of this was in the iteration phase, waiting for cells to show result.

Our Project

We will do all what Ginkgo does, but inside the computer. Ideally there would be a real world interface (through OpenTrons or others) that could handle the wet lab part automatically. An AWS for biological experiments. We will be missing loads of unaccounted information in our models, but this is just a proof of concept for how the industry is going to change. We will:

define the target molecule and desired yield
select the host strain
design the pathway
design the genetic circuit
construct the strain
build the experimental DBTL loop- test in simulated environment

How to Reproduce

Download the LASER dataset

contains real data of gene edits → yield

Train the EditScorer (supervised) on LASER data

GNN encoder built on the BiGG model graph (iML1515) to compute node embeddings

input: gene_edits
output: predicted_yield
supervision: final_yield from LASER

MLP scorer maps edit embeddings to predicted yield
loss is calculated as MSE between GNN predicted yield and LASER yield

Load the BiGG model of our organism

iML1515.json for E Coli
used for graph structure and COBRA simulation

Train an RL agent

environment: applies edits → runs cobra_model_cp.optimize() to simulate yields on top of BiGG iML1515 E. Coli model
policy: GNN → softmax(logits)
reward: 0.7 * cobra_yield + 0.3 * scorer_yield
loss: -reward * log π(action)
updates: gradients flow through loss.backprop() into GNN + scorer

Eval edits

sample through random KO/OE combinations

Notes

we use BiGG’s graph model of E. Coli as the input topology for the GNN (that’s what ‘built on’ means- it’s passed as input to the GNN)
we use the COBRA simulation of the BiGG model as the reward environment for RL
the edit → yield logic came from the laser pre-training
we use it as a comparison to the predicted output of the model we trained
computing the linear programming problems for COBRA is sequential, thus CPU bound, and takes way longer than the GPU training (went from 70/30 COBRA to 90/10 model)

we could parallelize the COBRA simulations but honestly this is just a practice project so no, we’re done

Math

How Frances learns to design better strains: one full forward + backward training step in the reinforcement learning loop

GNN embedding: We start with a graph of the cell's metabolism (E. coli from BiGG). The GNN computes an embedding for each reaction, capturing how it's connected to the rest of the network.

math:

code:

MLP scorer: Each reaction’s embedding is passed to an MLP, which outputs a logit score, showing how promising it is to knock out that reaction.

math:

code:

Action sampling: We apply a softmax over the logits to get a probability distribution over all possible edits. In Frances, we sample 3 reaction knockouts from this distribution. These become the actions it takes. math:

code:

COBRA simulation: The selected knockouts are applied to a copy of the metabolic model, and COBRA simulates the result (via FBA). This returns a predicted yield (product output). math:

code:

Loss: We use the reward (yield) to compute a reinforce loss, which increases the probability of selecting edits that led to high yield. math:

code:

Backprop: The loss is backpropogated through the MLP scorer and GNN to improve the parameters. math:

code:

Results

Training the RL agent and using COBRA simulations is too long on the CPU- we originally started with 70% COBRA and 30% model, then went to 10% COBRA and 90% model for the reward function. It was rated at taking 3 days to train on 24 thread 13th gen Intel, vs 3 hours on a $6/hour AWS 128 thread CPU=$18, but I’m not willing to spend money on this yet. I’m not knowledgable enough, but will be over the next few models, to gauge what a good program is to deploy. I increased the step count to 2000 and we are targeting for the whole agent to be trained in 1 hour (400 in 2.5hours- stalled due to memory leak, fixed with gc.collect() and cache cleanup).

What’s Happening:

early success: RL agent found perfect strategies early (reward = 1.000)
exploration decay: As training progressed, the agent explored more random strategies
negative rewards: Most recent strategies are getting very negative rewards (bad)
comparison: Random baseline (probably ~0.1) vs RL average (−0.94) = −125% (very bad)

The Fix:

A lot- our RL agent needs a better exploration strategy, more intense reward function tuning, learning rate adjustment, and more stable training. Overall the whole architecture needs to be redone. But, this is a project to learn, and we’ll implement the learnings in the future.

Commercial Potential

Though this specific project is crude and poorly constructed, an ensemble of these models + capsid design would greatly increase the leverage of or entirely replace the jobs below:

Title	Tasks	Tools	Reality
Metabolic Engineer	Designs gene edits to optimize yield	COBRApy, OptKnock, KBase	Often guesses or screens combinatorially
Synthetic Biologist	Assembles genetic constructs	Benchling, SnapGene	Knows what to build, but not always why it works
Strain Engineer	Builds and tests strains in lab	CRISPR kits, HPLC, qPCR	Limited by time and cost per experiment
Bioprocess Engineer	Optimizes fermentation conditions	Bioreactors, metabolomics	Focused on scale, not edit logic
Computational Biologist	Runs simulations (FBA)	COBRA, RAVEN, MATLAB	Struggles with sparse or noisy yield data
Data Scientist	Models gene → yield correlations	Python, ML libraries	Needs structured, labeled datasets

*Why aren’t there any libraries for parallelized COBRA simulations? It is approximating 3 days for 10,000 simulations (which isn’t even that many). There is probably an opportunity for parallelizing these model simulations for generative synthetic biology gene editing + yield production. It’ll be interesting to see fully vertically integrate organism + hardware feedback loops, with a model in the loop handling all the computation, like Tesla Autopilot.

Frances: Optimizing E. Coli Gene Edits → Output Yields

Cells as a Chemical Factory

Ginkgo Bioworks

Our Project

How to Reproduce

Notes

Math

Results

What’s Happening:

The Fix:

Commercial Potential