DNA Read, Write, and Edit
- DNA Read, Write, and Edit
- Part 1: Benchling & In-silico Gel Art
- Part 3: DNA Design Challenge
- 3.1 Choose your protein.
- 3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.
- 3.3 Codon optimization.
- 3.4 You have a sequence! Now what?
- 3.5 How does it work in nature/biological systems?
- Part 4: DNA Read/Write/Edit
- 4.1 DNA Read
- 4.2 DNA Write
- 4.3 DNA Edit
- Notes
Part 1: Benchling & In-silico Gel Art
This week, we used Benchling to simulate restriction enzymes used in an E Coli Phage Lambda to make a design in gel.
First, I went to the NCBI website to search for the Phage Lambda’s DNA.
After downloading it locally, I then uploaded it to Benchling in a new project. I highlighted the whole genome with Cmd + A, and selected the restriction enzymes (EcoRI, HindIII, BamHI, KpnI, EcoRV, SacI, SalI).
After doing that, I was pretty stuck on how to actually change them to make a smiley face.
So, I used ChatGPT to help. After running it twice, I realized that we are building the design over multiple runs. I think it would be ideal to have a neural network that does this.
So, I reduced the number of restriction enzymes to 3 just to see if there were fewer bands, meaning enough to create space for a smiley face. Obviously not. The next goal is to figure out how to get one band.
Let’s try one at a time.
Nice, it is working. Slot 2 is SwaI, slot 3 is PvuI, and slot 4 is NotI.
So what I am thinking is we can use SwaI and NotI as the eyes, then we can find a restriction enzyme with bars that are lower. Let’s find one for the center of the mouth, which will sit between the eyes.
We will try:
PstI– often cuts around 4-5 kb
SpeI- can create a mid-range band
XhoI- sometimes makes a good middle cut
NdeI– can be an alternative
PstI not good- cuts in way too many spots.
SpeI is also not good- we would just use it for the eyes. A wider eye, I guess.
Ugh, same problem with XhoI.
NdeI is a step in the right direction, as there are fewer cuts, but still not one solely in the middle.
After trying 3 more (MluI, BsrGI, ApaLI), I needed a new strategy, as opposed to brute force.
We need one clean cut in the 3-5 kb range.
DraI (Common 4-5 kb cutter)
BglII (Mid-range cut)
BspEI (Cuts around 3-5 kb)
Not working. Let’s start over with our new knowledge and at least get the eyes.
Simple. They work. Now, we’re going to do some genetic engineering + manipulation to make the biology adhere to our human image. We want a smiley face, we will cut the DNA to make our desired image.
We’ll start with BamHI, then remove form the bottom up, until we have the spectrum of depth for one side of the smiley face. We will simplify and limit the restriction enzymes to just NotI for the eyes, and BamHI + sequence modification for the smile.
We want to remove the EcoRV recognition sequence, so we will mutate the recognition site so EcoRV no longer recognizes it. GAT^ATC (cuts between T and A)
Mutation Strategy: Change GATATC → GATTTC
I noticed that there were 2 sequences, which i learned are the sub-sequences that EcoRV recognizes.
After editing, the EcoRV tag disappears.
We rerun with BamHI, with significantly shorter
After realizing that BamHI and EcoRV are different, and that my EcoRV edits in the genome were different from the BamHI, I just kept going to see how it worked. After getting the hang of it, i saw how editing the genome changed the Digests. Apparently, it is not tracking state for every run, but setting up the enzyme per run and running them in parallel on the newly modified genome.
…Due to time constraints, I needed to move onto the 3rd part of the homework. I will research better strategies for making them on YouTube.
…to be continued
Part 3: DNA Design Challenge
3.1 Choose your protein.
Botulinum Neurotoxin A (Botox)
I chose this because it is the most toxic known protein to humans, yet we use it in cosmetics, facial injections, and migraines. Humans are wild.
Function: Cleaves SNARE proteins to block neurotransmitter release.
~1,291 amino acids
3,900 letters
Uniprot
See below for protein sequence | GLP-1 (peptide hormone)
I chose this to understand the regulatory elements of it. It is clearly a popular drug, but doesn’t solve the root problem (like most drugs). I want to understand how it works to prevent appetite.
Function: Increases insulin secretion & slows gastric emptying.
~31 amino acids
90-93 letters
Uniprot
See below for protein sequence |
Botulinum Neurotoxin A (Botox) protein sequence
>sp|P0DPI0|BXA1_CLOBO Botulinum neurotoxin type A OS=Clostridium botulinum OX=1491 GN=botA PE=1 SV=1
MPFVNKQFNYKDPVNGVDIAYIKIPNVGQMQPVKAFKIHNKIWVIPERDTFTNPEEGDLNPPPEAKQVPVSYYDSTYLSTDNEKDNYLKGVTKLFERIYSTDLGRMLLTSIVRGIPFWGGSTIDTELKVIDTNCINVIQPDGSYRSEELNLVIIGPSADIIQFECKSFGHEVLNLTRNGYGSTQYIRFSPDFTFGFEESLEVDTNPLLGAGKFATDPAVTLAHELIHAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYYYNKFKDIASTLNKAKSIVGTTASLQYMKNVFKEKYLLSEDTSGKFSVDKLKFDKLYKMLTEIYTEDNFVKFFKVLNRKTYLNFDKAVFKINIVPKVNYTIYDGFNLRNTNLAANFNGQNTEINNMNFTKLKNFTGLFEFYKLLCVRGIITSKTKSLDKGYNKALNDLCIKVNNWDLFFSPSEDNFTNDLNKGEEITSDTNIEAAEENISLDLIQQYYLTFNFDNEPENISIENLSSDIIGQLELMPNIERFPNGKKYELDKYTMFHYLRAQEFEHGKSRIALTNSVNEALLNPSRVYTFFSSDYVKKVNKATEAAMFLGWVEQLVYDFTDETSEVSTTDKIADITIIIPYIGPALNIGNMLYKDDFVGALIFSGAVILLEFIPEIAIPVLGTFALVSYIANKVLTVQTIDNALSKRNEKWDEVYKYIVTNWLAKVNTQIDLIRKKMKEALENQAEATKAIINYQYNQYTEEEKNNINFNIDDLSSKLNESINKAMININKFLNQCSVSYLMNSMIPYGVKRLEDFDASLKDALLKYIYDNRGTLIGQVDRLKDKVNNTLSTDIPFQLSKYVDNQRLLSTFTEYIKNIINTSILNLRYESNHLIDLSRYASKINIGSKVNFDPIDKNQIQLFNLESSKIEVILKNAIVYNSMYENFSTSFWIRIPKYFNSISLNNEYTIINCMENNSGWKVSLNYGEIIWTLQDTQEIKQRVVFKYSQMINISDYINRWIFVTITNNRLNNSKIYINGRLIDQKPISNLGNIHASNNIMFKLDGCRDTHRYIWIKYFNLFDKELNEKEIKDLYDNQSNSGILKDFWGDYLQYDKPYYMLNLYDPNKYVDVNNVGIRGYMYLKGPRGSVMTTNIYLNSSLYRGTKFIIKKYASGNKDNIVRNNDRVYINVVVKNKEYRLATNASQAGVEKILSALEIPDVGNLSQVVVMKSKNDQGITNKCKMNLQDNNGNDIGFIGFHQFNNIAKLVASNWYNRQIERSSRTLGCSWEFIPVDDGWGERPL
GLP-1 protein sequence
>sp|P01275|GLUC_HUMAN Pro-glucagon OS=Homo sapiens OX=9606 GN=GCG PE=1 SV=3
MKSIYFVAGLFVMLVQGSWQRSLQDTEEKSRSFSASQADPLSDPDQMNEDKRHSQGTFTSDYSKYLDSRRAQDFVQWLMNTKRNRNNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRRDFPEEVAIVEELGRRHADGSFSDEMNTILDNLAARDFINWLIQTKITDRK
3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.
Botulinum Neurotoxin A (Botox) reverse translate
GLP-1 reverse translate
reverse translation of sp|P0DPI0|BXA1_CLOBO Botulinum neurotoxin type A OS=Clostridium botulinum OX=1491 GN=botA PE=1 SV=1 to a 3888 base sequence of most likely codons.
atgccgtttgtgaacaaacagtttaactataaagatccggtgaacggcgtggatattgcgtatattaaaattccgaacgtgggccagatgcagccggtgaaagcgtttaaaattcataacaaaatttgggtgattccggaacgcgatacctttaccaacccggaagaaggcgatctgaacccgccgccggaagcgaaacaggtgccggtgagctattatgatagcacctatctgagcaccgataacgaaaaagataactatctgaaaggcgtgaccaaactgtttgaacgcatttatagcaccgatctgggccgcatgctgctgaccagcattgtgcgcggcattccgttttggggcggcagcaccattgataccgaactgaaagtgattgataccaactgcattaacgtgattcagccggatggcagctatcgcagcgaagaactgaacctggtgattattggcccgagcgcggatattattcagtttgaatgcaaaagctttggccatgaagtgctgaacctgacccgcaacggctatggcagcacccagtatattcgctttagcccggattttacctttggctttgaagaaagcctggaagtggataccaacccgctgctgggcgcgggcaaatttgcgaccgatccggcggtgaccctggcgcatgaactgattcatgcgggccatcgcctgtatggcattgcgattaacccgaaccgcgtgtttaaagtgaacaccaacgcgtattatgaaatgagcggcctggaagtgagctttgaagaactgcgcacctttggcggccatgatgcgaaatttattgatagcctgcaggaaaacgaatttcgcctgtattattataacaaatttaaagatattgcgagcaccctgaacaaagcgaaaagcattgtgggcaccaccgcgagcctgcagtatatgaaaaacgtgtttaaagaaaaatatctgctgagcgaagataccagcggcaaatttagcgtggataaactgaaatttgataaactgtataaaatgctgaccgaaatttataccgaagataactttgtgaaattttttaaagtgctgaaccgcaaaacctatctgaactttgataaagcggtgtttaaaattaacattgtgccgaaagtgaactataccatttatgatggctttaacctgcgcaacaccaacctggcggcgaactttaacggccagaacaccgaaattaacaacatgaactttaccaaactgaaaaactttaccggcctgtttgaattttataaactgctgtgcgtgcgcggcattattaccagcaaaaccaaaagcctggataaaggctataacaaagcgctgaacgatctgtgcattaaagtgaacaactgggatctgttttttagcccgagcgaagataactttaccaacgatctgaacaaaggcgaagaaattaccagcgataccaacattgaagcggcggaagaaaacattagcctggatctgattcagcagtattatctgacctttaactttgataacgaaccggaaaacattagcattgaaaacctgagcagcgatattattggccagctggaactgatgccgaacattgaacgctttccgaacggcaaaaaatatgaactggataaatataccatgtttcattatctgcgcgcgcaggaatttgaacatggcaaaagccgcattgcgctgaccaacagcgtgaacgaagcgctgctgaacccgagccgcgtgtataccttttttagcagcgattatgtgaaaaaagtgaacaaagcgaccgaagcggcgatgtttctgggctgggtggaacagctggtgtatgattttaccgatgaaaccagcgaagtgagcaccaccgataaaattgcggatattaccattattattccgtatattggcccggcgctgaacattggcaacatgctgtataaagatgattttgtgggcgcgctgatttttagcggcgcggtgattctgctggaatttattccggaaattgcgattccggtgctgggcacctttgcgctggtgagctatattgcgaacaaagtgctgaccgtgcagaccattgataacgcgctgagcaaacgcaacgaaaaatgggatgaagtgtataaatatattgtgaccaactggctggcgaaagtgaacacccagattgatctgattcgcaaaaaaatgaaagaagcgctggaaaaccaggcggaagcgaccaaagcgattattaactatcagtataaccagtataccgaagaagaaaaaaacaacattaactttaacattgatgatctgagcagcaaactgaacgaaagcattaacaaagcgatgattaacattaacaaatttctgaaccagtgcagcgtgagctatctgatgaacagcatgattccgtatggcgtgaaacgcctggaagattttgatgcgagcctgaaagatgcgctgctgaaatatatttatgataaccgcggcaccctgattggccaggtggatcgcctgaaagataaagtgaacaacaccctgagcaccgatattccgtttcagctgagcaaatatgtggataaccagcgcctgctgagcacctttaccgaatatattaaaaacattattaacaccagcattctgaacctgcgctatgaaagcaaccatctgattgatctgagccgctatgcgagcaaaattaacattggcagcaaagtgaactttgatccgattgataaaaaccagattcagctgtttaacctggaaagcagcaaaattgaagtgattctgaaaaacgcgattgtgtataacagcatgtatgaaaactttagcaccagcttttggattcgcattccgaaatattttaacagcattagcctgaacaacgaatataccattattaactgcatggaaaacaacagcggctggaaagtgagcctgaactatggcgaaattatttggaccctgcaggatacccaggaaattaaacagcgcgtggtgtttaaatatagccagatgattaacattagcgattatattaaccgctggatttttgtgaccattaccaacaaccgcctgaacaacagcaaaatttatattaacggccgcctgattgatcagaaaccgattagcaacctgggcaacattcatgcgagcaacaacattatgtttaaactggatggctgccgcgatacccatcgctatatttggattaaatattttaacctgtttgataaagaactgaacgaaaaagaaattaaagatctgtatgataaccagagcaacagcggcattctgaaagatttttggggcgattatctgcagtatgataaaccgtattatatgctgaacctgtatgatccgaacaaatatgtggatgtgaacaacgtgggcattcgcggctatatgtatctgaaaggcccgcgcggcagcgtgatgaccaccaacatttatctgaacagcagcctgtatcgcggcaccaaatttattattaaaaaatatgcgagcggcaacaaagataacattgtgcgcaacaacgatcgcgtgtatattaacgtggtggtgaaaaacaaagaatatcgcctggcgaccaacgcgagccaggcgggcgtggaaaaaattctgagcgcgctggaaattccggatgtgggcaacctgagccaggtggtggtgatgaaaagcaaaaacgatcagggcattaccaacaaatgcaaaatgaacctgcaggataacaacggcaacgatattggctttattggctttcatcagtttaacaacattgcgaaactggtggcgagcaactggtataaccgccagattgaacgcagcagccgcaccctgggctgcagctgggaatttattccggtggatgatggctggggcgaacgcccgctg
reverse translation of sp|P01275|GLUC_HUMAN Pro-glucagon OS=Homo sapiens OX=9606 GN=GCG PE=1 SV=3 to a 540 base sequence of most likely codons.
atgaaaagcatttattttgtggcgggcctgtttgtgatgctggtgcagggcagctggcagcgcagcctgcaggataccgaagaaaaaagccgcagctttagcgcgagccaggcggatccgctgagcgatccggatcagatgaacgaagataaacgccatagccagggcacctttaccagcgattatagcaaatatctggatagccgccgcgcgcaggattttgtgcagtggctgatgaacaccaaacgcaaccgcaacaacattgcgaaacgccatgatgaatttgaacgccatgcggaaggcacctttaccagcgatgtgagcagctatctggaaggccaggcggcgaaagaatttattgcgtggctggtgaaaggccgcggccgccgcgattttccggaagaagtggcgattgtggaagaactgggccgccgccatgcggatggcagctttagcgatgaaatgaacaccattctggataacctggcggcgcgcgattttattaactggctgattcagaccaaaattaccgatcgcaaa
3.3 Codon optimization.
You optimize codon usage to throttle, declare, and describe the desired output protein + yield. You are essentially optimizing the manufacturing part of the cell. I chose E. Coli because that is what the Peptide 2.0 website defaults to, although there is definitely opportunity to choose something better (though I think there is opportunity to create a neural network for optimizing the cell <> protein output yield). //TODO: optimize cell choice based on output protein, input cost, and quality
Cell → DNA sequence→ corresponding mRNA → specific production site → protein
Optimizations are done with VectorBuilder.
Botulinum Neurotoxin A (Botox)
Improved DNA[1]: GC=78.31%, CAI=0.87
>BOTULINUM_NEUROTOXIN_A
GCGACCGGTTGCTGCGGCACCACCACCGGTACCGGTGCGGCCTGTGCAGCGGCCTGCGCGGGTACCACCACCGCAGCCTGTACGGCCACCGCCGCGGCAGGTGCCACGTGCTGCGGCGGCACCGGTGCCGCCTGCGGCGGCTGCGGCACCGGCGGCGCCACCGCGACCACTGGCTGCGGCACCGCGACCGCCACCACCGCGGCCGCCGCGACCACCTGCTGCGGCGCGGCGTGTGGCACCGGCGGCGGTTGCTGTGCCGGCGCGACAGGCTGTGCGGGCTGCTGCGGCGGCACCGGCGCAGCCGCGGGCTGCGGCACCACTACCGCCGCAGCGGCGACGACCTGTGCGACCGCGGCCTGCGCGGCGGCGGCGACCACCACCGGTGGCGGCACCGGTGCGACCACCTGCTGTGGCGGCGCGGCGTGTGGCTGCGGCGCCACCGCCTGCTGCACCACCACCGCGTGCTGCGCCGCCTGTTGTTGCGGCGGCGCCGCGGGCGCGGCGGGCGGTTGCGGCGCGACCTGTACCGGCGCGGCCTGCTGTTGCGGTTGTTGCGGCTGCTGCGGCGGTGCAGCGGGTTGCGGCGCGGCCGCGTGTGCCGGCGGCACCGGTTGCTGCGGCGGTACCGGCGCGGGTTGTACTGCGACCACCGCGACCGGCGCGACCGCCGGCTGCGCGTGCTGTACCGCAACCTGCACCGGTGCGGGCTGCGCGTGTTGCGGCGCGACCGCGGCGTGCGGCGCGGCGGCGGCAGCGGGCGCGACCGCGGCGTGCACGGCCACCTGCACCGGTGCGGCGGCGGGCGGCTGCGGCACCGGTGCATGCTGTGCGGCCGCATGTGCGGGTACCACCACTGGTGCGGCCTGCGGCTGTGCCACGACGACCGCGACCGCGGGCTGTGCCTGCTGCGGCGCGACGTGCACCGGCGGTGGTTGCTGTGGCTGCGCGACCGGCTGTACCGGATGCACCGGCGCGTGCTGTGCGGGCTGCGCCACCACAGGCACCGGCTGCGGCTGCGGTGGCTGTGCCACCACCTGCTGCGGCACCACCACCACCGGCGGCGGTGGCTGCGGTGGTTGTGCGGGCTGTGCGTGTTGCGCGACCACCGGCGCTACCGCGTGCTGTGGTGCGGCCGGCACCGGCGCCGCGGCGGGCACCGGCGCGACCACCGGTGCGACCGCCTGCTGCGCGGCGTGCACCGGCTGCGCGACGACCGCGGCCTGCGGTACCGGCGCCACCACCTGCGCAGGCTGTTGTGGTGGCGCCACCGGTGGCTGCGCGGGCTGCACCGCGACCTGTGGCTGCGCCGGCTGCGGCGCCGCGGGCGCGGCGTGCACCGGCGCGGCGTGCTGCACCGGCGGCACCGGCGCCACAACCGCGACCACCGGCGGCTGCTGCTGCGGCGCAGGCTGCGGCTGCGGCGGAGCCACGGCGACCACCGCGACCACCTGCGCAGGTACCACTACCGGCGCGGCGACCGGCTGTGCGGCGGCCGCTGGTTGCACCACCACCGGAGGCTGTTGCGCAACCGGTGCCGCGGGCACCGGCTGTACGGGTGCCGCCTGCTGCACCGGCGCCTGCTGTTGTGGTTGTGCCGCCTGTGGTGGCTGCACTGCGACCGGCGGCTGCGCGGGCTGCGCGTGCTGCTGCGCGGGTACCGCCACCGCGACCACTTGCGGCTGTACCACCACCGCAGGCTGCTGCTGTGGCGGCGCAACCACCACCACCGCCTGTTGCACCACCACGGGCGGTTGCACCACCACCGGCGCAGCCGGCGCCGCCGCCGGCTGCTGCACCGGCGGTGCCGCAGGCACCGGCGGCGCAACTGCCTGTTGCGCGGCCTGCTGCTGCGGCTGCACCGGCTGCACGGGTGGCGGCTGCGGCTGTGGCGGCGGCTGTGCAGCGGCAACCACTACCGGCTGCGGCGCATGCTGTGGCGCCACGTGCTGTGGTGGTTGTGGCGGCACCGGCGCGTGCTGTTGCACGGGTGGTTGCGGCTGCGCCACCGGTGCGGCGTGCACCGGAGCGACCACCTGTGCGACCGGCTGCGGCGGCGGCTGCTGCGCGACCTGCGGCTGCTGCACCGGTACCGCGACCGGTGGTTGCGCCACCACCGGTTGCGGCGCCACCACCGCGGCGTGCTGCTGCGGTGCCGCGTGCTGCGGGTGCGGCACCGGCACCACCACCGCGGCAGCGGGTACCGGCGCGGCGTGCGCGTGCTGCGCGGCCTGTGGCTGCGGCACCGCCACCACGGCGACCGGCGCGGCGGCCACGGGTGCGGGTTGCGGCGGCTGCTGCACCGGCGGTGCCGCGGGTACCGGCGCGGGCTGTACCACCACCGGCGCAGCCGGTGCGGCGTGCACCGGCTGCGGTTGTGCGTGTTGTACCACCACCGGCGGCTGCGGCGGTTGCTGCGCGACCGGCGCGACCGGCTGCGGCGCGGCGGCGACCACCACCGCCACAACCGGCGCGACCGCGGGTTGCTGTACCGGCTGCGCAGGTGGCGCGGCGGCAGCGTGCGGCGCGGCCACCACCACCTGCGGCTGCTGTACCGGCACCGCCACCACCGCCACTACGGCCACCGCTGCCTGCGCGGCAGCGACCACCACCGCCGCGGCGGGCGCGACCGCCACCACCGGTTGCGGTGCGGGCTGTGCCTGTTGCTGCACCGGCGCAGCCTGTGCCGCAGCGGGTTGTGGCGCGGCCGCCGCCGGCTGCGCAACTACGGGCACGGGCGGCGGCTGCGCGTGTTGCGCCTGCTGCGGTTGCGGAGCCGGCTGCTGCACCGGCTGCGCGGGCACCGCCACCGCGACTGGCGCGGCCGCAGCCGCCTGCGGCACCGGCACCACCACCGCGGCGGCGGGCGCGGCGGCCGCAGCCACGGCGACCTGTACTGGCTGCACCGGTGCGGGTTGTGGTGCGGCGGGCGCAACCGCGTGCTGCGCAGGCTGCGGCGGTTGCGCGGCGGCGACAACCACCGCGGGCTGCGGCACCGGCGGCGCGACCGCCGCCGCGTGTACCGGTGCGGCGGCCACCACCACTGGCGCGACCGCGGCGGCCTGTACGGGCACCGCCACCGCGGCGGCCGCCACCGGCTGCACCGGCGCATGCTGCGGCGCCGCGACGACCACCACCGCGACCGCCTGCTGCGGCGCGGCCGGCGCGACGGCGGCTTGTACCACCACTGGCACGGGCGCCGCGGCGACTACCACCACCACCACCGCGGCGGCGGGTACCGGCTGCACCGGCGCCGCGTGTTGTGGCTGTGCGGCGGCGGCATGTTGTACCGCCACCTGTACCGGCGCCGCGACCTGCACTGGTGCGGCGTGCGCGGCGGCTGGTGGTTGCGGCGCGGCGGCCTGCGGTGGTGCGGCCTGTGCCACCACCACCGCCACCACCGGCTGTGGCACCGGCGGCTGCACCGGCGGCGGCTGTGGTGCGGCGGCCGGTGGTTGCTGTGGCTGCGGCGGCTGTTGCGGCTGTTGTGGCTGCGGTGCCACCACCACCACCTGCTGTGGTGGCGCCGCCGGCGCGGCTGGCACCGGTGGCTGCGGCGCGACCACGGGCACCGGCGGCGCAGCGGGCGCCGCGTGCACGGGCGGTGGCTGTTGCGGCTGCTGCGGTTGCTGTGCCACCGGCTGCGGTGGTGCGACCGGGGGTTGCGCGGGTTGCACCACTACCGCGGGCTGTGGTGCAACCGGTGCGGCCGCCACCGGCGCGGCCTGTGCCTGTTGCGCGACGACTTGTACCGGTGGCGCAACCGCGGCGTGCTGTACCGGCGGCTGCGGCGGTTGCGGCTGCGGCTGCGGGGCCACCACCACCACCGCAACCACCGCGGCGTGCGGTGGCTGTACCGGCGCGACGACCTGCGCGGGCGCCTGCTGCGCAGCGGCGGCGACCACCGCATGCTGCGGCGCCACGTGCGGCTGCGCGGCAGCCTAA
GLP-1
>GLP1
GCAACTGGTGCGGCAGCCGCGGGCTGCGCGACCACAACCGCCACCACTACTACGGGCACCGGCGGCTGTGGCGGCGGTTGCTGTACCGGCACCACCACCGGCACTGGCGCCACCGGTTGTACCGGCGGCACCGGCTGTGCGGGCGGCGGTTGCGCGGGTTGCACTGGCGGCTGCGCCGGCTGTGGCTGCGCCGGCTGTTGTACCGGCTGTGCGGGTGGCGCGACCGCGTGCTGCGGTGCCGCCGGCGCGGCGGCGGCGGCGGCCGGCTGCTGCGGCTGCGCGGGCTGCACCACCACCGCGGGCTGTGGCTGCGGTGCGGGTTGTTGTGCAGGCGGCTGTGGCGGCGCAACGTGCTGTGGTTGCACCGGCGCCGGTTGTGGAGCGACTTGTTGTGGCGGCGCGACGTGTGCGGGTGCCACCGGCGCGGCCTGCGGCGCGGCGGGCGCGACCGCGGCGGCGTGCGGCTGCTGTGCGACTGCCGGTTGCTGTGCGGGCGGCGGTTGTGCGTGCTGTACCACCACCGCGTGTTGCGCCGGCTGCGGCGCGACCACCGCGACTGCCGGCTGTGCAGCGGCGACCGCGACCTGCACCGGCGGCGCCACCGCGGGCTGCTGTGGTTGTTGCGGCTGCGGTTGCGGCTGCGCGGGCGGCGCGACCACCACCACCGGCACGGGTTGCGCGGGTACCGGCGGTTGTACCGGTGCGACTGGTGCGGCGTGCGCGTGCTGTGCAGCGGCGTGCGGCTGCGCCGCCTGCTGCGGCTGCGCCGCGTGCGCGGCGTGCGCGACCACAGGCTGTGGCGCGGCGGCGTGTGGTTGCTGCGCCACCGGCGCCACCGGCGCCGCGACCACGACCGGCGCCGCGTGTGGCTGCTGTGCGACCGGCTGCGGTGGCGCGGCCGGCGGCTGCGCATGCTGCACCACCACCGCGTGCTGCGCGGGCTGTGGCGCGACCGGCACCGGTGCGGGTTGTGCCGGTTGCACCGCCACCTGCACCGGTGGCGCGGCGGGCGGCTGCTGCGCGGGCGGCTGCGGCGGCTGTGGCGCGGCGGCGGGTGCCGCCACCACCACCGCCACCACCGGCTGCGGTACCGGCGGCTGCACCGGCGGCACCGGTGCGGCCGCAGGCGGCTGCTGCGGCTGTGGCGGCTGCTGTGGTTGTTGCGGCTGCGGTGCGACGACCACCACCTGCTGCGGCGGCGCGGCCGGCGCCGCGGGCACCGGTGGCTGCGGCGCGACCACGGGCACCGGTGGCGCGGCCGGCGCGGCGTGTACCGGTGGCGGCTGCTGCGGCTGTTGCGGCTGTTGCGCGACGGGCTGTGGTGGCGCGACCGGCGGCTGTGCGGGCTGCACCACTACCGCGGGTTGCGGCGCAACCGGTGCCGCGGCGACCGGTGCGGCGTGCGCGTGCTGCGCGACCACCTGTACCGGCGGCGCGACTGCGGCGTGCTGCACCGGTGGCTGTGGCGGCTGTGGTTGCGGATGCGGCGCGACCACGACCACCGCCACCACGGCGGCGTGTGGCGGCTGCACCGGTGCGACCACCTGCGCGGGCGCCTGTTGCGCCGCGGCTGCGACCACTGCGTGTTGCGGCGCGACCTGTGGCTGTGCGGCGGCGTAA
3.4 You have a sequence! Now what?
Cell-dependent
This method uses living cells to transcribe DNA into mRNA and translate it into a functional protein. Botulinum Neurotoxin A (Botox) is typically produced in Clostridium botulinum, but apparently, recombinant versions can be expressed in safer bacterial or mammalian systems. GLP-1, on the other hand, is often produced in E. coli or yeast for industrial-scale peptide drug production.
The process for doing this is as follows:
- The codon-optimized gene is introduced into an expression vector (a plasmid) under the control of a strong promoter
- The vector is introduced into a host system (like E-coli, Saccharomyces cerevisiae, or mammalian cells like HEK293 or CHO)
- The host cell machinery transcribes the gene into mRNA and translates it into protein using its ribosomes
- Proteins fold and undergo post-translational modifications in eukaryotic systems
- The protein is extracted using affinity chromatography (His-tag or antibody-based purification).
- are there more ways?
- Mass spectrometry and/or Western blotting verifies the structure, purity, and function as a a QA process
Cell-free
This method uses ribosomes and enzymes outside of living cells to directly produce protein.
The process is as follows:
- Lysates from E. Coli, rabbit reticulocytes, or wheat germ provide ribosomes, tRNAs, and enzymes
- Linear or plasmid DNA or mRNA containing the gene of interest is added
- A coupled reaction (catalyst-enabled) reaction converts DNA into mRNA and translates it into protein using ribosomes
- Proteins fold
*this is way more efficient and controllable, cell-free manufacturing + personalized cell QA testing seems like the best way forward
3.5 How does it work in nature/biological systems?
- A single gene code for multiple proteins at the transcriptional level through methods like alternative splicing, RNA editing (post translation), and alternative translation initiation.
- DNA sequence containing exons and introns (in eukaryotes) → U replaces T in mRNA → protein (3 nucleotides for one amino acid)
//TODO: actually align the 3 in Benchling
Part 4: DNA Read/Write/Edit
4.1 DNA Read
- What DNA would you want to sequence (e.g., read) and why?Â
- In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions:
- Oxford Nanopore (ONT) + PacBio HiFi hybrid sequencing because it combines ultra-long reads for genome assembly & structural variants (ONT) and high accuracy for single-molecule resolution (PacBio HiFi)
- What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.
- Sample collection + DNA/RNA extraction using Qiagen or Nanobind
- Fragmentation for shorter HiFi reads OR left intact for ONT ultra-long reads, depends on the sequence length from what source
- Adapter ligation to enzymatically attach sequencing adapters
- Prepare the library by inserting barcodes for multiplexing samples
- Load into the ONT/PacBio machines
- What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?
- ONT
- DNA is fed through a biological nanopore
- Changes in electrical current detect nucleotide identity
- NN (Bonito, Guppy) convert current fluctuations into sequence data
- PacBio HiFi
- Single-molecule real-time sequencing (ultra fine-grained), where circular DNA moves through polymerase
- Fluorescently labeled nucleotides emit light as they are incorporated
- HiFi reads are generated by sequencing the same molecule multiple times
- What is the output of your chosen sequencing technology?
- ONT
- Ultra-long reads (≤ 4MB)
- FAST5 → FASTQ
- Epigenetic modifications are directly detectable (methylation)
- PacBio HiFi
- Highly accurate (99.9%) for short to mid reads (10-25kb)
- Output is FASTQ / BAM / CCS (circular consensus sequencing)
I would sequence the full human microbiome and metagenome from diverse environments, including the human gut, every organ, wastewater, soil, and extreme environments. Ideally we could also digitalize all input genomes from the food we consume, air inside + outside the home, animals in the home, all surfaces + their bacteria, and everyone that that individual interacts with. The goal would be to build a humanity-level immunity network, to detect disease at the source and stop it. You want to shorten the gap between disease creation and humanity consumption. Obviously, to do this, the details below are insufficient.
3rd generation- no need to do PCR amplification.
4.2 DNA Write
- What DNA would you want to synthesize (e.g., write) and why? Something that resembles a computer- meaning a sensor, genetic circuit for computation on that input, and a response This circuit would function as a programmable sensor inside human cells, capable of detecting and responding to specific biomarkers (inflammatory cytokines, cancer markers, environmental toxins). The core elements include:
- Sensor module: a synthetic promoter that activates transcription in response to specific molecules (inflammatory cytokines like IL-6 or TNF-α)
- Processing module: a CRISPR-based logic gate (dCas9-based transcriptional control) that enables Boolean computation (AND, OR, NOT gates)
- Response module: a genetically encoded therapeutic output, such as an anti-inflammatory peptide, apoptosis inducer (for cancer), or fluorescent reporter for tracking
- What technology or technologies would you use to perform this DNA synthesis and why?
- Twist Bioscience DNA Synthesis (in-silico DNA writing)
- Cell-free TxTL (transcription-translation) for rapid testing
- Golden Gate & Gibson Assembly for assembly of large constructs (or something newer, not familiar enough yet with the algorithms)
- What are the essential steps of your chosen sequencing methods?
- Oligos are synthesized on silicon chips using phosphoramidite chemistry, so thousands can be synthesized in parallel
- Small oligos are error-corrected and assembled using ligation or PCR
- Assembled DNA is cloned into vectors
- Synthesized DNA is delivered as plasmids or linear DNA, ready for direct transformation into cells or cell-free systems
- What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?
Promoter -> IL-6 response element -> dCas9 + sgRNA (targeting repressor) -> GFP Reporter
Speed | Oligo synthesis + assembly takes 1-2 weeks, nowhere near real-time |
Scalability | Limited to 300kb per construct, requiring modular assembly for larger genomes |
Accuracy | Error rate = 1:10,000 bases, requiring sequencing verification (NGS, Sanger) |
Cost | Expensive for long constructs, but scalable for small genetic circuits |
4.3 DNA Edit
- Â What DNA would you want to edit and why? I would edit human immune system genes to enhance disease resistance and longevity. I constantly notice friends and family members developing allergies and inflammation, including myself. Some genes from a quick search:
- CCR5 Δ32 mutation: deleting CCR5 for HIV immunity and viral resistance
- PCSK9 knockout: disrupting PCSK9 to lower LDL cholesterol, preventing cardiovascular disease, the #1 cause of death globally
- MYC and TP53 regulation: fine-tuning expression of these genes to increase regenerative capacity while suppressing cancer (math problem)
- Mitochondrial gene editing: correcting mutations associated with aging and metabolic diseases, which I believe is the engine of life
- Enhanced DNA repair mechanisms: increasing expression of genes like SIRT6, FOXO3, and GADD45 to extend lifespan by reducing genomic instability. It would be a positive for humanity if we could systematically accelerate healing
- What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:
- Prime Editing: For precise, efficient edits without double-strand breaks
- Base Editing: For single-letter mutations with ultra-high accuracy
- CRISPR-Cas9 & Cas12a: for targeted gene knockouts and insertions
- How does your technology of choice edit DNA? What are the essential steps?
- What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?
- Gene target selection: identify mutation sites using NGS + clinical data
- Guide RNA design: generate pegRNAs and sgRNAs
- Delivery method selection: viral (AAV, lentivirus) or non-viral (lipid nanoparticles, electroporation) depending on the cell
- What are the limitations of your editing methods (if any) in terms of efficiency or precision?
- Prime editing: low efficiency (~5-50%) in some cell types, delivery still being optimized
- Base editing: only allows C→T and A→G changes, limiting its use for all mutations
- CRISPR-Cas9: higher risk of off-target mutations and requires a repair template for precise insertions
Prime Editing (human gene therapy) | Base Editing (SNPs, fixing disease mutations) | CRISPR-Cas9 & Cas12a (disrupting disease-causing mutations) |
Uses a Cas9 nickase fused to a reverse transcriptase | Uses a Cas9 nickase fused to a deaminase. enzyme | Cas9 has larger double stranded breaks, has knockout/insertion via HDR |
A pegRNA encodes the desired edit | Converts C → T or A → G mutations without cutting DNA | Cas12a has higher specificity and leaves staggered cuts, improving large DNA insertions |
RT writes the new DNA directly into the target site without causing double stranded breaks | High accuracy, low off-target effects, minimal errors | |
No need for donor DNA templates, reducing error rates |
Notes
- large molecules need models to interact with them, or specialized hardware + software to move them around more easily
- codon optimization is used for maximizing the output protein yield
- cell → DNA sequence→ corresponding mRNA → specific production site → protein
- deep learning codon optimization: Codon optimization with deep learning to enhance protein expression
- RNN variant that learns contextual dependencies between DNA sequences to predict codon sequences, optimizes for consistency
- 17-40 hours of training (why the variance?)
- test accuracy: 52%
- Codon Adaptation Index of ~0.98, outperforming Genewiz (0.83) and ThermoFisher (0.93)
- codon box encoding: groups synonymous codons into sets based on their nucleotide composition
- sequence funnel: protein → protein sequence → reverse translation → codon optimization → organism
- T7 and SP6 RNA polymerases: DNA-dependent RNA polymerases derived from bacteriophages T7 and SP6
- they recognize specific promoter sequences (T7 or SP6) and transcribe RNA with high specificity and efficiency
- widely used in in vitro transcription systems for synthesizing mRNA, ribozymes, and RNA probes. T7 RNA polymerase is more common due to its high processivity and fast transcription rate
- DNA synthesis is mainly for smaller reads- lots of opportunity to expand industrialization
- codon optimization- optimize your sequence for the specific host that you are using