Week 2 Homework
🧬

Week 2 Homework

DNA Read, Write, and Edit

Part 1: Benchling & In-silico Gel Art

This week, we used Benchling to simulate restriction enzymes used in an E Coli Phage Lambda to make a design in gel.

First, I went to the NCBI website to search for the Phage Lambda’s DNA.

image

After downloading it locally, I then uploaded it to Benchling in a new project. I highlighted the whole genome with Cmd + A, and selected the restriction enzymes (EcoRI, HindIII, BamHI, KpnI, EcoRV, SacI, SalI).

image
image

After doing that, I was pretty stuck on how to actually change them to make a smiley face.

image
image

So, I used ChatGPT to help. After running it twice, I realized that we are building the design over multiple runs. I think it would be ideal to have a neural network that does this.

image

So, I reduced the number of restriction enzymes to 3 just to see if there were fewer bands, meaning enough to create space for a smiley face. Obviously not. The next goal is to figure out how to get one band.

image
image

Let’s try one at a time.

image
image

Nice, it is working. Slot 2 is SwaI, slot 3 is PvuI, and slot 4 is NotI.

So what I am thinking is we can use SwaI and NotI as the eyes, then we can find a restriction enzyme with bars that are lower. Let’s find one for the center of the mouth, which will sit between the eyes.

We will try:

PstI– often cuts around 4-5 kb

SpeI- can create a mid-range band

XhoI- sometimes makes a good middle cut

NdeI– can be an alternative

image

PstI not good- cuts in way too many spots.

image

SpeI is also not good- we would just use it for the eyes. A wider eye, I guess.

image

Ugh, same problem with XhoI.

image

NdeI is a step in the right direction, as there are fewer cuts, but still not one solely in the middle.

After trying 3 more (MluI, BsrGI, ApaLI), I needed a new strategy, as opposed to brute force.

We need one clean cut in the 3-5 kb range.

DraI (Common 4-5 kb cutter)

BglII (Mid-range cut)

BspEI (Cuts around 3-5 kb)

image

Not working. Let’s start over with our new knowledge and at least get the eyes.

image

Simple. They work. Now, we’re going to do some genetic engineering + manipulation to make the biology adhere to our human image. We want a smiley face, we will cut the DNA to make our desired image.

image

We’ll start with BamHI, then remove form the bottom up, until we have the spectrum of depth for one side of the smiley face. We will simplify and limit the restriction enzymes to just NotI for the eyes, and BamHI + sequence modification for the smile.

We want to remove the EcoRV recognition sequence, so we will mutate the recognition site so EcoRV no longer recognizes it. GAT^ATC (cuts between T and A)

Mutation Strategy: Change GATATC → GATTTC

image
image

I noticed that there were 2 sequences, which i learned are the sub-sequences that EcoRV recognizes.

image

After editing, the EcoRV tag disappears.

image

We rerun with BamHI, with significantly shorter

After realizing that BamHI and EcoRV are different, and that my EcoRV edits in the genome were different from the BamHI, I just kept going to see how it worked. After getting the hang of it, i saw how editing the genome changed the Digests. Apparently, it is not tracking state for every run, but setting up the enzyme per run and running them in parallel on the newly modified genome.

image

…Due to time constraints, I needed to move onto the 3rd part of the homework. I will research better strategies for making them on YouTube.

image

…to be continued

Part 3: DNA Design Challenge

3.1 Choose your protein.

Botulinum Neurotoxin A (Botox) I chose this because it is the most toxic known protein to humans, yet we use it in cosmetics, facial injections, and migraines. Humans are wild. Function: Cleaves SNARE proteins to block neurotransmitter release. ~1,291 amino acids 3,900 letters Uniprot See below for protein sequence
GLP-1 (peptide hormone) I chose this to understand the regulatory elements of it. It is clearly a popular drug, but doesn’t solve the root problem (like most drugs). I want to understand how it works to prevent appetite. Function: Increases insulin secretion & slows gastric emptying. ~31 amino acids 90-93 letters Uniprot See below for protein sequence

Botulinum Neurotoxin A (Botox) protein sequence

>sp|P0DPI0|BXA1_CLOBO Botulinum neurotoxin type A OS=Clostridium botulinum OX=1491 GN=botA PE=1 SV=1
MPFVNKQFNYKDPVNGVDIAYIKIPNVGQMQPVKAFKIHNKIWVIPERDTFTNPEEGDLNPPPEAKQVPVSYYDSTYLSTDNEKDNYLKGVTKLFERIYSTDLGRMLLTSIVRGIPFWGGSTIDTELKVIDTNCINVIQPDGSYRSEELNLVIIGPSADIIQFECKSFGHEVLNLTRNGYGSTQYIRFSPDFTFGFEESLEVDTNPLLGAGKFATDPAVTLAHELIHAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYYYNKFKDIASTLNKAKSIVGTTASLQYMKNVFKEKYLLSEDTSGKFSVDKLKFDKLYKMLTEIYTEDNFVKFFKVLNRKTYLNFDKAVFKINIVPKVNYTIYDGFNLRNTNLAANFNGQNTEINNMNFTKLKNFTGLFEFYKLLCVRGIITSKTKSLDKGYNKALNDLCIKVNNWDLFFSPSEDNFTNDLNKGEEITSDTNIEAAEENISLDLIQQYYLTFNFDNEPENISIENLSSDIIGQLELMPNIERFPNGKKYELDKYTMFHYLRAQEFEHGKSRIALTNSVNEALLNPSRVYTFFSSDYVKKVNKATEAAMFLGWVEQLVYDFTDETSEVSTTDKIADITIIIPYIGPALNIGNMLYKDDFVGALIFSGAVILLEFIPEIAIPVLGTFALVSYIANKVLTVQTIDNALSKRNEKWDEVYKYIVTNWLAKVNTQIDLIRKKMKEALENQAEATKAIINYQYNQYTEEEKNNINFNIDDLSSKLNESINKAMININKFLNQCSVSYLMNSMIPYGVKRLEDFDASLKDALLKYIYDNRGTLIGQVDRLKDKVNNTLSTDIPFQLSKYVDNQRLLSTFTEYIKNIINTSILNLRYESNHLIDLSRYASKINIGSKVNFDPIDKNQIQLFNLESSKIEVILKNAIVYNSMYENFSTSFWIRIPKYFNSISLNNEYTIINCMENNSGWKVSLNYGEIIWTLQDTQEIKQRVVFKYSQMINISDYINRWIFVTITNNRLNNSKIYINGRLIDQKPISNLGNIHASNNIMFKLDGCRDTHRYIWIKYFNLFDKELNEKEIKDLYDNQSNSGILKDFWGDYLQYDKPYYMLNLYDPNKYVDVNNVGIRGYMYLKGPRGSVMTTNIYLNSSLYRGTKFIIKKYASGNKDNIVRNNDRVYINVVVKNKEYRLATNASQAGVEKILSALEIPDVGNLSQVVVMKSKNDQGITNKCKMNLQDNNGNDIGFIGFHQFNNIAKLVASNWYNRQIERSSRTLGCSWEFIPVDDGWGERPL

GLP-1 protein sequence

>sp|P01275|GLUC_HUMAN Pro-glucagon OS=Homo sapiens OX=9606 GN=GCG PE=1 SV=3
MKSIYFVAGLFVMLVQGSWQRSLQDTEEKSRSFSASQADPLSDPDQMNEDKRHSQGTFTSDYSKYLDSRRAQDFVQWLMNTKRNRNNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRRDFPEEVAIVEELGRRHADGSFSDEMNTILDNLAARDFINWLIQTKITDRK

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

Botulinum Neurotoxin A (Botox) reverse translate

GLP-1 reverse translate

reverse translation of sp|P0DPI0|BXA1_CLOBO Botulinum neurotoxin type A OS=Clostridium botulinum OX=1491 GN=botA PE=1 SV=1 to a 3888 base sequence of most likely codons.
atgccgtttgtgaacaaacagtttaactataaagatccggtgaacggcgtggatattgcgtatattaaaattccgaacgtgggccagatgcagccggtgaaagcgtttaaaattcataacaaaatttgggtgattccggaacgcgatacctttaccaacccggaagaaggcgatctgaacccgccgccggaagcgaaacaggtgccggtgagctattatgatagcacctatctgagcaccgataacgaaaaagataactatctgaaaggcgtgaccaaactgtttgaacgcatttatagcaccgatctgggccgcatgctgctgaccagcattgtgcgcggcattccgttttggggcggcagcaccattgataccgaactgaaagtgattgataccaactgcattaacgtgattcagccggatggcagctatcgcagcgaagaactgaacctggtgattattggcccgagcgcggatattattcagtttgaatgcaaaagctttggccatgaagtgctgaacctgacccgcaacggctatggcagcacccagtatattcgctttagcccggattttacctttggctttgaagaaagcctggaagtggataccaacccgctgctgggcgcgggcaaatttgcgaccgatccggcggtgaccctggcgcatgaactgattcatgcgggccatcgcctgtatggcattgcgattaacccgaaccgcgtgtttaaagtgaacaccaacgcgtattatgaaatgagcggcctggaagtgagctttgaagaactgcgcacctttggcggccatgatgcgaaatttattgatagcctgcaggaaaacgaatttcgcctgtattattataacaaatttaaagatattgcgagcaccctgaacaaagcgaaaagcattgtgggcaccaccgcgagcctgcagtatatgaaaaacgtgtttaaagaaaaatatctgctgagcgaagataccagcggcaaatttagcgtggataaactgaaatttgataaactgtataaaatgctgaccgaaatttataccgaagataactttgtgaaattttttaaagtgctgaaccgcaaaacctatctgaactttgataaagcggtgtttaaaattaacattgtgccgaaagtgaactataccatttatgatggctttaacctgcgcaacaccaacctggcggcgaactttaacggccagaacaccgaaattaacaacatgaactttaccaaactgaaaaactttaccggcctgtttgaattttataaactgctgtgcgtgcgcggcattattaccagcaaaaccaaaagcctggataaaggctataacaaagcgctgaacgatctgtgcattaaagtgaacaactgggatctgttttttagcccgagcgaagataactttaccaacgatctgaacaaaggcgaagaaattaccagcgataccaacattgaagcggcggaagaaaacattagcctggatctgattcagcagtattatctgacctttaactttgataacgaaccggaaaacattagcattgaaaacctgagcagcgatattattggccagctggaactgatgccgaacattgaacgctttccgaacggcaaaaaatatgaactggataaatataccatgtttcattatctgcgcgcgcaggaatttgaacatggcaaaagccgcattgcgctgaccaacagcgtgaacgaagcgctgctgaacccgagccgcgtgtataccttttttagcagcgattatgtgaaaaaagtgaacaaagcgaccgaagcggcgatgtttctgggctgggtggaacagctggtgtatgattttaccgatgaaaccagcgaagtgagcaccaccgataaaattgcggatattaccattattattccgtatattggcccggcgctgaacattggcaacatgctgtataaagatgattttgtgggcgcgctgatttttagcggcgcggtgattctgctggaatttattccggaaattgcgattccggtgctgggcacctttgcgctggtgagctatattgcgaacaaagtgctgaccgtgcagaccattgataacgcgctgagcaaacgcaacgaaaaatgggatgaagtgtataaatatattgtgaccaactggctggcgaaagtgaacacccagattgatctgattcgcaaaaaaatgaaagaagcgctggaaaaccaggcggaagcgaccaaagcgattattaactatcagtataaccagtataccgaagaagaaaaaaacaacattaactttaacattgatgatctgagcagcaaactgaacgaaagcattaacaaagcgatgattaacattaacaaatttctgaaccagtgcagcgtgagctatctgatgaacagcatgattccgtatggcgtgaaacgcctggaagattttgatgcgagcctgaaagatgcgctgctgaaatatatttatgataaccgcggcaccctgattggccaggtggatcgcctgaaagataaagtgaacaacaccctgagcaccgatattccgtttcagctgagcaaatatgtggataaccagcgcctgctgagcacctttaccgaatatattaaaaacattattaacaccagcattctgaacctgcgctatgaaagcaaccatctgattgatctgagccgctatgcgagcaaaattaacattggcagcaaagtgaactttgatccgattgataaaaaccagattcagctgtttaacctggaaagcagcaaaattgaagtgattctgaaaaacgcgattgtgtataacagcatgtatgaaaactttagcaccagcttttggattcgcattccgaaatattttaacagcattagcctgaacaacgaatataccattattaactgcatggaaaacaacagcggctggaaagtgagcctgaactatggcgaaattatttggaccctgcaggatacccaggaaattaaacagcgcgtggtgtttaaatatagccagatgattaacattagcgattatattaaccgctggatttttgtgaccattaccaacaaccgcctgaacaacagcaaaatttatattaacggccgcctgattgatcagaaaccgattagcaacctgggcaacattcatgcgagcaacaacattatgtttaaactggatggctgccgcgatacccatcgctatatttggattaaatattttaacctgtttgataaagaactgaacgaaaaagaaattaaagatctgtatgataaccagagcaacagcggcattctgaaagatttttggggcgattatctgcagtatgataaaccgtattatatgctgaacctgtatgatccgaacaaatatgtggatgtgaacaacgtgggcattcgcggctatatgtatctgaaaggcccgcgcggcagcgtgatgaccaccaacatttatctgaacagcagcctgtatcgcggcaccaaatttattattaaaaaatatgcgagcggcaacaaagataacattgtgcgcaacaacgatcgcgtgtatattaacgtggtggtgaaaaacaaagaatatcgcctggcgaccaacgcgagccaggcgggcgtggaaaaaattctgagcgcgctggaaattccggatgtgggcaacctgagccaggtggtggtgatgaaaagcaaaaacgatcagggcattaccaacaaatgcaaaatgaacctgcaggataacaacggcaacgatattggctttattggctttcatcagtttaacaacattgcgaaactggtggcgagcaactggtataaccgccagattgaacgcagcagccgcaccctgggctgcagctgggaatttattccggtggatgatggctggggcgaacgcccgctg
reverse translation of sp|P01275|GLUC_HUMAN Pro-glucagon OS=Homo sapiens OX=9606 GN=GCG PE=1 SV=3 to a 540 base sequence of most likely codons.
atgaaaagcatttattttgtggcgggcctgtttgtgatgctggtgcagggcagctggcagcgcagcctgcaggataccgaagaaaaaagccgcagctttagcgcgagccaggcggatccgctgagcgatccggatcagatgaacgaagataaacgccatagccagggcacctttaccagcgattatagcaaatatctggatagccgccgcgcgcaggattttgtgcagtggctgatgaacaccaaacgcaaccgcaacaacattgcgaaacgccatgatgaatttgaacgccatgcggaaggcacctttaccagcgatgtgagcagctatctggaaggccaggcggcgaaagaatttattgcgtggctggtgaaaggccgcggccgccgcgattttccggaagaagtggcgattgtggaagaactgggccgccgccatgcggatggcagctttagcgatgaaatgaacaccattctggataacctggcggcgcgcgattttattaactggctgattcagaccaaaattaccgatcgcaaa

3.3 Codon optimization.

You optimize codon usage to throttle, declare, and describe the desired output protein + yield. You are essentially optimizing the manufacturing part of the cell. I chose E. Coli because that is what the Peptide 2.0 website defaults to, although there is definitely opportunity to choose something better (though I think there is opportunity to create a neural network for optimizing the cell <> protein output yield). //TODO: optimize cell choice based on output protein, input cost, and quality

Cell → DNA sequence→ corresponding mRNA → specific production site → protein

Optimizations are done with VectorBuilder.

Botulinum Neurotoxin A (Botox)

Improved DNA[1]: GC=78.31%, CAI=0.87

>BOTULINUM_NEUROTOXIN_A
GCGACCGGTTGCTGCGGCACCACCACCGGTACCGGTGCGGCCTGTGCAGCGGCCTGCGCGGGTACCACCACCGCAGCCTGTACGGCCACCGCCGCGGCAGGTGCCACGTGCTGCGGCGGCACCGGTGCCGCCTGCGGCGGCTGCGGCACCGGCGGCGCCACCGCGACCACTGGCTGCGGCACCGCGACCGCCACCACCGCGGCCGCCGCGACCACCTGCTGCGGCGCGGCGTGTGGCACCGGCGGCGGTTGCTGTGCCGGCGCGACAGGCTGTGCGGGCTGCTGCGGCGGCACCGGCGCAGCCGCGGGCTGCGGCACCACTACCGCCGCAGCGGCGACGACCTGTGCGACCGCGGCCTGCGCGGCGGCGGCGACCACCACCGGTGGCGGCACCGGTGCGACCACCTGCTGTGGCGGCGCGGCGTGTGGCTGCGGCGCCACCGCCTGCTGCACCACCACCGCGTGCTGCGCCGCCTGTTGTTGCGGCGGCGCCGCGGGCGCGGCGGGCGGTTGCGGCGCGACCTGTACCGGCGCGGCCTGCTGTTGCGGTTGTTGCGGCTGCTGCGGCGGTGCAGCGGGTTGCGGCGCGGCCGCGTGTGCCGGCGGCACCGGTTGCTGCGGCGGTACCGGCGCGGGTTGTACTGCGACCACCGCGACCGGCGCGACCGCCGGCTGCGCGTGCTGTACCGCAACCTGCACCGGTGCGGGCTGCGCGTGTTGCGGCGCGACCGCGGCGTGCGGCGCGGCGGCGGCAGCGGGCGCGACCGCGGCGTGCACGGCCACCTGCACCGGTGCGGCGGCGGGCGGCTGCGGCACCGGTGCATGCTGTGCGGCCGCATGTGCGGGTACCACCACTGGTGCGGCCTGCGGCTGTGCCACGACGACCGCGACCGCGGGCTGTGCCTGCTGCGGCGCGACGTGCACCGGCGGTGGTTGCTGTGGCTGCGCGACCGGCTGTACCGGATGCACCGGCGCGTGCTGTGCGGGCTGCGCCACCACAGGCACCGGCTGCGGCTGCGGTGGCTGTGCCACCACCTGCTGCGGCACCACCACCACCGGCGGCGGTGGCTGCGGTGGTTGTGCGGGCTGTGCGTGTTGCGCGACCACCGGCGCTACCGCGTGCTGTGGTGCGGCCGGCACCGGCGCCGCGGCGGGCACCGGCGCGACCACCGGTGCGACCGCCTGCTGCGCGGCGTGCACCGGCTGCGCGACGACCGCGGCCTGCGGTACCGGCGCCACCACCTGCGCAGGCTGTTGTGGTGGCGCCACCGGTGGCTGCGCGGGCTGCACCGCGACCTGTGGCTGCGCCGGCTGCGGCGCCGCGGGCGCGGCGTGCACCGGCGCGGCGTGCTGCACCGGCGGCACCGGCGCCACAACCGCGACCACCGGCGGCTGCTGCTGCGGCGCAGGCTGCGGCTGCGGCGGAGCCACGGCGACCACCGCGACCACCTGCGCAGGTACCACTACCGGCGCGGCGACCGGCTGTGCGGCGGCCGCTGGTTGCACCACCACCGGAGGCTGTTGCGCAACCGGTGCCGCGGGCACCGGCTGTACGGGTGCCGCCTGCTGCACCGGCGCCTGCTGTTGTGGTTGTGCCGCCTGTGGTGGCTGCACTGCGACCGGCGGCTGCGCGGGCTGCGCGTGCTGCTGCGCGGGTACCGCCACCGCGACCACTTGCGGCTGTACCACCACCGCAGGCTGCTGCTGTGGCGGCGCAACCACCACCACCGCCTGTTGCACCACCACGGGCGGTTGCACCACCACCGGCGCAGCCGGCGCCGCCGCCGGCTGCTGCACCGGCGGTGCCGCAGGCACCGGCGGCGCAACTGCCTGTTGCGCGGCCTGCTGCTGCGGCTGCACCGGCTGCACGGGTGGCGGCTGCGGCTGTGGCGGCGGCTGTGCAGCGGCAACCACTACCGGCTGCGGCGCATGCTGTGGCGCCACGTGCTGTGGTGGTTGTGGCGGCACCGGCGCGTGCTGTTGCACGGGTGGTTGCGGCTGCGCCACCGGTGCGGCGTGCACCGGAGCGACCACCTGTGCGACCGGCTGCGGCGGCGGCTGCTGCGCGACCTGCGGCTGCTGCACCGGTACCGCGACCGGTGGTTGCGCCACCACCGGTTGCGGCGCCACCACCGCGGCGTGCTGCTGCGGTGCCGCGTGCTGCGGGTGCGGCACCGGCACCACCACCGCGGCAGCGGGTACCGGCGCGGCGTGCGCGTGCTGCGCGGCCTGTGGCTGCGGCACCGCCACCACGGCGACCGGCGCGGCGGCCACGGGTGCGGGTTGCGGCGGCTGCTGCACCGGCGGTGCCGCGGGTACCGGCGCGGGCTGTACCACCACCGGCGCAGCCGGTGCGGCGTGCACCGGCTGCGGTTGTGCGTGTTGTACCACCACCGGCGGCTGCGGCGGTTGCTGCGCGACCGGCGCGACCGGCTGCGGCGCGGCGGCGACCACCACCGCCACAACCGGCGCGACCGCGGGTTGCTGTACCGGCTGCGCAGGTGGCGCGGCGGCAGCGTGCGGCGCGGCCACCACCACCTGCGGCTGCTGTACCGGCACCGCCACCACCGCCACTACGGCCACCGCTGCCTGCGCGGCAGCGACCACCACCGCCGCGGCGGGCGCGACCGCCACCACCGGTTGCGGTGCGGGCTGTGCCTGTTGCTGCACCGGCGCAGCCTGTGCCGCAGCGGGTTGTGGCGCGGCCGCCGCCGGCTGCGCAACTACGGGCACGGGCGGCGGCTGCGCGTGTTGCGCCTGCTGCGGTTGCGGAGCCGGCTGCTGCACCGGCTGCGCGGGCACCGCCACCGCGACTGGCGCGGCCGCAGCCGCCTGCGGCACCGGCACCACCACCGCGGCGGCGGGCGCGGCGGCCGCAGCCACGGCGACCTGTACTGGCTGCACCGGTGCGGGTTGTGGTGCGGCGGGCGCAACCGCGTGCTGCGCAGGCTGCGGCGGTTGCGCGGCGGCGACAACCACCGCGGGCTGCGGCACCGGCGGCGCGACCGCCGCCGCGTGTACCGGTGCGGCGGCCACCACCACTGGCGCGACCGCGGCGGCCTGTACGGGCACCGCCACCGCGGCGGCCGCCACCGGCTGCACCGGCGCATGCTGCGGCGCCGCGACGACCACCACCGCGACCGCCTGCTGCGGCGCGGCCGGCGCGACGGCGGCTTGTACCACCACTGGCACGGGCGCCGCGGCGACTACCACCACCACCACCGCGGCGGCGGGTACCGGCTGCACCGGCGCCGCGTGTTGTGGCTGTGCGGCGGCGGCATGTTGTACCGCCACCTGTACCGGCGCCGCGACCTGCACTGGTGCGGCGTGCGCGGCGGCTGGTGGTTGCGGCGCGGCGGCCTGCGGTGGTGCGGCCTGTGCCACCACCACCGCCACCACCGGCTGTGGCACCGGCGGCTGCACCGGCGGCGGCTGTGGTGCGGCGGCCGGTGGTTGCTGTGGCTGCGGCGGCTGTTGCGGCTGTTGTGGCTGCGGTGCCACCACCACCACCTGCTGTGGTGGCGCCGCCGGCGCGGCTGGCACCGGTGGCTGCGGCGCGACCACGGGCACCGGCGGCGCAGCGGGCGCCGCGTGCACGGGCGGTGGCTGTTGCGGCTGCTGCGGTTGCTGTGCCACCGGCTGCGGTGGTGCGACCGGGGGTTGCGCGGGTTGCACCACTACCGCGGGCTGTGGTGCAACCGGTGCGGCCGCCACCGGCGCGGCCTGTGCCTGTTGCGCGACGACTTGTACCGGTGGCGCAACCGCGGCGTGCTGTACCGGCGGCTGCGGCGGTTGCGGCTGCGGCTGCGGGGCCACCACCACCACCGCAACCACCGCGGCGTGCGGTGGCTGTACCGGCGCGACGACCTGCGCGGGCGCCTGCTGCGCAGCGGCGGCGACCACCGCATGCTGCGGCGCCACGTGCGGCTGCGCGGCAGCCTAA

GLP-1

>GLP1
GCAACTGGTGCGGCAGCCGCGGGCTGCGCGACCACAACCGCCACCACTACTACGGGCACCGGCGGCTGTGGCGGCGGTTGCTGTACCGGCACCACCACCGGCACTGGCGCCACCGGTTGTACCGGCGGCACCGGCTGTGCGGGCGGCGGTTGCGCGGGTTGCACTGGCGGCTGCGCCGGCTGTGGCTGCGCCGGCTGTTGTACCGGCTGTGCGGGTGGCGCGACCGCGTGCTGCGGTGCCGCCGGCGCGGCGGCGGCGGCGGCCGGCTGCTGCGGCTGCGCGGGCTGCACCACCACCGCGGGCTGTGGCTGCGGTGCGGGTTGTTGTGCAGGCGGCTGTGGCGGCGCAACGTGCTGTGGTTGCACCGGCGCCGGTTGTGGAGCGACTTGTTGTGGCGGCGCGACGTGTGCGGGTGCCACCGGCGCGGCCTGCGGCGCGGCGGGCGCGACCGCGGCGGCGTGCGGCTGCTGTGCGACTGCCGGTTGCTGTGCGGGCGGCGGTTGTGCGTGCTGTACCACCACCGCGTGTTGCGCCGGCTGCGGCGCGACCACCGCGACTGCCGGCTGTGCAGCGGCGACCGCGACCTGCACCGGCGGCGCCACCGCGGGCTGCTGTGGTTGTTGCGGCTGCGGTTGCGGCTGCGCGGGCGGCGCGACCACCACCACCGGCACGGGTTGCGCGGGTACCGGCGGTTGTACCGGTGCGACTGGTGCGGCGTGCGCGTGCTGTGCAGCGGCGTGCGGCTGCGCCGCCTGCTGCGGCTGCGCCGCGTGCGCGGCGTGCGCGACCACAGGCTGTGGCGCGGCGGCGTGTGGTTGCTGCGCCACCGGCGCCACCGGCGCCGCGACCACGACCGGCGCCGCGTGTGGCTGCTGTGCGACCGGCTGCGGTGGCGCGGCCGGCGGCTGCGCATGCTGCACCACCACCGCGTGCTGCGCGGGCTGTGGCGCGACCGGCACCGGTGCGGGTTGTGCCGGTTGCACCGCCACCTGCACCGGTGGCGCGGCGGGCGGCTGCTGCGCGGGCGGCTGCGGCGGCTGTGGCGCGGCGGCGGGTGCCGCCACCACCACCGCCACCACCGGCTGCGGTACCGGCGGCTGCACCGGCGGCACCGGTGCGGCCGCAGGCGGCTGCTGCGGCTGTGGCGGCTGCTGTGGTTGTTGCGGCTGCGGTGCGACGACCACCACCTGCTGCGGCGGCGCGGCCGGCGCCGCGGGCACCGGTGGCTGCGGCGCGACCACGGGCACCGGTGGCGCGGCCGGCGCGGCGTGTACCGGTGGCGGCTGCTGCGGCTGTTGCGGCTGTTGCGCGACGGGCTGTGGTGGCGCGACCGGCGGCTGTGCGGGCTGCACCACTACCGCGGGTTGCGGCGCAACCGGTGCCGCGGCGACCGGTGCGGCGTGCGCGTGCTGCGCGACCACCTGTACCGGCGGCGCGACTGCGGCGTGCTGCACCGGTGGCTGTGGCGGCTGTGGTTGCGGATGCGGCGCGACCACGACCACCGCCACCACGGCGGCGTGTGGCGGCTGCACCGGTGCGACCACCTGCGCGGGCGCCTGTTGCGCCGCGGCTGCGACCACTGCGTGTTGCGGCGCGACCTGTGGCTGTGCGGCGGCGTAA

3.4 You have a sequence! Now what?

Cell-dependent

This method uses living cells to transcribe DNA into mRNA and translate it into a functional protein. Botulinum Neurotoxin A (Botox) is typically produced in Clostridium botulinum, but apparently, recombinant versions can be expressed in safer bacterial or mammalian systems. GLP-1, on the other hand, is often produced in E. coli or yeast for industrial-scale peptide drug production.

The process for doing this is as follows:

  1. The codon-optimized gene is introduced into an expression vector (a plasmid) under the control of a strong promoter
  2. The vector is introduced into a host system (like E-coli, Saccharomyces cerevisiae, or mammalian cells like HEK293 or CHO)
  3. The host cell machinery transcribes the gene into mRNA and translates it into protein using its ribosomes
  4. Proteins fold and undergo post-translational modifications in eukaryotic systems
  5. The protein is extracted using affinity chromatography (His-tag or antibody-based purification).
    1. are there more ways?
  6. Mass spectrometry and/or Western blotting verifies the structure, purity, and function as a a QA process

Cell-free

This method uses ribosomes and enzymes outside of living cells to directly produce protein.

The process is as follows:

  1. Lysates from E. Coli, rabbit reticulocytes, or wheat germ provide ribosomes, tRNAs, and enzymes
  2. Linear or plasmid DNA or mRNA containing the gene of interest is added
  3. A coupled reaction (catalyst-enabled) reaction converts DNA into mRNA and translates it into protein using ribosomes
  4. Proteins fold

*this is way more efficient and controllable, cell-free manufacturing + personalized cell QA testing seems like the best way forward

3.5 How does it work in nature/biological systems?

  1. A single gene code for multiple proteins at the transcriptional level through methods like alternative splicing, RNA editing (post translation), and alternative translation initiation.
  2. DNA sequence containing exons and introns (in eukaryotes) → U replaces T in mRNA → protein (3 nucleotides for one amino acid)

//TODO: actually align the 3 in Benchling

Part 4: DNA Read/Write/Edit

4.1 DNA Read

  1. What DNA would you want to sequence (e.g., read) and why? 
  2. I would sequence the full human microbiome and metagenome from diverse environments, including the human gut, every organ, wastewater, soil, and extreme environments. Ideally we could also digitalize all input genomes from the food we consume, air inside + outside the home, animals in the home, all surfaces + their bacteria, and everyone that that individual interacts with. The goal would be to build a humanity-level immunity network, to detect disease at the source and stop it. You want to shorten the gap between disease creation and humanity consumption. Obviously, to do this, the details below are insufficient.

  3. In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?  Also answer the following questions:
    1. Oxford Nanopore (ONT) + PacBio HiFi hybrid sequencing because it combines ultra-long reads for genome assembly & structural variants (ONT) and high accuracy for single-molecule resolution (PacBio HiFi)
    2. 3rd generation- no need to do PCR amplification.

    3. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.
      1. Sample collection + DNA/RNA extraction using Qiagen or Nanobind
      2. Fragmentation for shorter HiFi reads OR left intact for ONT ultra-long reads, depends on the sequence length from what source
      3. Adapter ligation to enzymatically attach sequencing adapters
      4. Prepare the library by inserting barcodes for multiplexing samples
      5. Load into the ONT/PacBio machines
    4. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?
      1. ONT
        1. DNA is fed through a biological nanopore
        2. Changes in electrical current detect nucleotide identity
        3. NN (Bonito, Guppy) convert current fluctuations into sequence data
      2. PacBio HiFi
        1. Single-molecule real-time sequencing (ultra fine-grained), where circular DNA moves through polymerase
        2. Fluorescently labeled nucleotides emit light as they are incorporated
        3. HiFi reads are generated by sequencing the same molecule multiple times
    5. What is the output of your chosen sequencing technology?
      1. ONT
        1. Ultra-long reads (≤ 4MB)
        2. FAST5 → FASTQ
        3. Epigenetic modifications are directly detectable (methylation)
      2. PacBio HiFi
        1. Highly accurate (99.9%) for short to mid reads (10-25kb)
        2. Output is FASTQ / BAM / CCS (circular consensus sequencing)

4.2 DNA Write

  1. What DNA would you want to synthesize (e.g., write) and why?  Something that resembles a computer- meaning a sensor, genetic circuit for computation on that input, and a response This circuit would function as a programmable sensor inside human cells, capable of detecting and responding to specific biomarkers (inflammatory cytokines, cancer markers, environmental toxins). The core elements include:
    1. Sensor module: a synthetic promoter that activates transcription in response to specific molecules (inflammatory cytokines like IL-6 or TNF-α)
    2. Processing module: a CRISPR-based logic gate (dCas9-based transcriptional control) that enables Boolean computation (AND, OR, NOT gates)
    3. Response module: a genetically encoded therapeutic output, such as an anti-inflammatory peptide, apoptosis inducer (for cancer), or fluorescent reporter for tracking
    4. Promoter -> IL-6 response element -> dCas9 + sgRNA (targeting repressor) -> GFP Reporter  
  2. What technology or technologies would you use to perform this DNA synthesis and why?
    1. Twist Bioscience DNA Synthesis (in-silico DNA writing)
    2. Cell-free TxTL (transcription-translation) for rapid testing
    3. Golden Gate & Gibson Assembly for assembly of large constructs (or something newer, not familiar enough yet with the algorithms)
      1. What are the essential steps of your chosen sequencing methods?
        1. Oligos are synthesized on silicon chips using phosphoramidite chemistry, so thousands can be synthesized in parallel
        2. Small oligos are error-corrected and assembled using ligation or PCR
        3. Assembled DNA is cloned into vectors
        4. Synthesized DNA is delivered as plasmids or linear DNA, ready for direct transformation into cells or cell-free systems
      2. What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?
      3. Speed
        Oligo synthesis + assembly takes 1-2 weeks, nowhere near real-time
        Scalability
        Limited to 300kb per construct, requiring modular assembly for larger genomes
        Accuracy
        Error rate = 1:10,000 bases, requiring sequencing verification (NGS, Sanger)
        Cost
        Expensive for long constructs, but scalable for small genetic circuits

4.3 DNA Edit

  1.  What DNA would you want to edit and why? I would edit human immune system genes to enhance disease resistance and longevity. I constantly notice friends and family members developing allergies and inflammation, including myself. Some genes from a quick search:
    1. CCR5 Δ32 mutation: deleting CCR5 for HIV immunity and viral resistance
    2. PCSK9 knockout: disrupting PCSK9 to lower LDL cholesterol, preventing cardiovascular disease, the #1 cause of death globally
    3. MYC and TP53 regulation: fine-tuning expression of these genes to increase regenerative capacity while suppressing cancer (math problem)
    4. Mitochondrial gene editing: correcting mutations associated with aging and metabolic diseases, which I believe is the engine of life
    5. Enhanced DNA repair mechanisms: increasing expression of genes like SIRT6, FOXO3, and GADD45 to extend lifespan by reducing genomic instability. It would be a positive for humanity if we could systematically accelerate healing
  2. What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:
    1. Prime Editing: For precise, efficient edits without double-strand breaks
    2. Base Editing: For single-letter mutations with ultra-high accuracy
    3. CRISPR-Cas9 & Cas12a: for targeted gene knockouts and insertions
      1. How does your technology of choice edit DNA? What are the essential steps?
      Prime Editing (human gene therapy)
      Base Editing (SNPs, fixing disease mutations)
      CRISPR-Cas9 & Cas12a (disrupting disease-causing mutations)
      Uses a Cas9 nickase fused to a reverse transcriptase
      Uses a Cas9 nickase fused to a deaminase. enzyme
      Cas9 has larger double stranded breaks, has knockout/insertion via HDR
      A pegRNA encodes the desired edit
      Converts C → T or A → G mutations without cutting DNA
      Cas12a has higher specificity and leaves staggered cuts, improving large DNA insertions
      RT writes the new DNA directly into the target site without causing double stranded breaks
      High accuracy, low off-target effects, minimal errors
      No need for donor DNA templates, reducing error rates
    4. What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?
      1. Gene target selection: identify mutation sites using NGS + clinical data
      2. Guide RNA design: generate pegRNAs and sgRNAs
      3. Delivery method selection: viral (AAV, lentivirus) or non-viral (lipid nanoparticles, electroporation) depending on the cell
    5. What are the limitations of your editing methods (if any) in terms of efficiency or precision?
      1. Prime editing: low efficiency (~5-50%) in some cell types, delivery still being optimized
      2. Base editing: only allows C→T and A→G changes, limiting its use for all mutations
      3. CRISPR-Cas9: higher risk of off-target mutations and requires a repair template for precise insertions

Notes

  • large molecules need models to interact with them, or specialized hardware + software to move them around more easily
  • codon optimization is used for maximizing the output protein yield
    • cell → DNA sequence→ corresponding mRNA → specific production site → protein
    • deep learning codon optimization: Codon optimization with deep learning to enhance protein expression
      • RNN variant that learns contextual dependencies between DNA sequences to predict codon sequences, optimizes for consistency
      • 17-40 hours of training (why the variance?)
      • test accuracy: 52%
      • Codon Adaptation Index of ~0.98, outperforming Genewiz (0.83) and ThermoFisher (0.93)
      • codon box encoding: groups synonymous codons into sets based on their nucleotide composition
  • sequence funnel: protein → protein sequence → reverse translation → codon optimization → organism
  • T7 and SP6 RNA polymerases: DNA-dependent RNA polymerases derived from bacteriophages T7 and SP6
    • they recognize specific promoter sequences (T7 or SP6) and transcribe RNA with high specificity and efficiency
    • widely used in in vitro transcription systems for synthesizing mRNA, ribozymes, and RNA probes. T7 RNA polymerase is more common due to its high processivity and fast transcription rate
  • DNA synthesis is mainly for smaller reads- lots of opportunity to expand industrialization
  • codon optimization- optimize your sequence for the specific host that you are using