DNA Read, Write, and Edit

DNA Read, Write, and Edit
Part 1: Benchling & In-silico Gel Art
Part 3: DNA Design Challenge
3.1 Choose your protein.
3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.
3.3 Codon optimization.
3.4 You have a sequence! Now what?
3.5 How does it work in nature/biological systems?
Part 4: DNA Read/Write/Edit
4.1 DNA Read
4.2 DNA Write
4.3 DNA Edit
Notes

Part 1: Benchling & In-silico Gel Art

This week, we used Benchling to simulate restriction enzymes used in an E Coli Phage Lambda to make a design in gel.

First, I went to the NCBI website to search for the Phage Lambda’s DNA.

After downloading it locally, I then uploaded it to Benchling in a new project. I highlighted the whole genome with Cmd + A, and selected the restriction enzymes (EcoRI, HindIII, BamHI, KpnI, EcoRV, SacI, SalI).

After doing that, I was pretty stuck on how to actually change them to make a smiley face.

So, I used ChatGPT to help. After running it twice, I realized that we are building the design over multiple runs. I think it would be ideal to have a neural network that does this.

So, I reduced the number of restriction enzymes to 3 just to see if there were fewer bands, meaning enough to create space for a smiley face. Obviously not. The next goal is to figure out how to get one band.

Let’s try one at a time.

Nice, it is working. Slot 2 is SwaI, slot 3 is PvuI, and slot 4 is NotI.

So what I am thinking is we can use SwaI and NotI as the eyes, then we can find a restriction enzyme with bars that are lower. Let’s find one for the center of the mouth, which will sit between the eyes.

We will try:

PstI– often cuts around 4-5 kb

SpeI- can create a mid-range band

XhoI- sometimes makes a good middle cut

NdeI– can be an alternative

PstI not good- cuts in way too many spots.

SpeI is also not good- we would just use it for the eyes. A wider eye, I guess.

Ugh, same problem with XhoI.

NdeI is a step in the right direction, as there are fewer cuts, but still not one solely in the middle.

After trying 3 more (MluI, BsrGI, ApaLI), I needed a new strategy, as opposed to brute force.

We need one clean cut in the 3-5 kb range.

DraI (Common 4-5 kb cutter)

BglII (Mid-range cut)

BspEI (Cuts around 3-5 kb)

Not working. Let’s start over with our new knowledge and at least get the eyes.

Simple. They work. Now, we’re going to do some genetic engineering + manipulation to make the biology adhere to our human image. We want a smiley face, we will cut the DNA to make our desired image.

We’ll start with BamHI, then remove form the bottom up, until we have the spectrum of depth for one side of the smiley face. We will simplify and limit the restriction enzymes to just NotI for the eyes, and BamHI + sequence modification for the smile.

We want to remove the EcoRV recognition sequence, so we will mutate the recognition site so EcoRV no longer recognizes it. GAT^ATC (cuts between T and A)

Mutation Strategy: Change GATATC → GATTTC

I noticed that there were 2 sequences, which i learned are the sub-sequences that EcoRV recognizes.

After editing, the EcoRV tag disappears.

We rerun with BamHI, with significantly shorter

After realizing that BamHI and EcoRV are different, and that my EcoRV edits in the genome were different from the BamHI, I just kept going to see how it worked. After getting the hang of it, i saw how editing the genome changed the Digests. Apparently, it is not tracking state for every run, but setting up the enzyme per run and running them in parallel on the newly modified genome.

…Due to time constraints, I needed to move onto the 3rd part of the homework. I will research better strategies for making them on YouTube.

…to be continued

Part 3: DNA Design Challenge

3.1 Choose your protein.

Botulinum Neurotoxin A (Botox) I chose this because it is the most toxic known protein to humans, yet we use it in cosmetics, facial injections, and migraines. Humans are wild. Function: Cleaves SNARE proteins to block neurotransmitter release. ~1,291 amino acids 3,900 letters Uniprot See below for protein sequence

GLP-1 (peptide hormone) I chose this to understand the regulatory elements of it. It is clearly a popular drug, but doesn’t solve the root problem (like most drugs). I want to understand how it works to prevent appetite. Function: Increases insulin secretion & slows gastric emptying. ~31 amino acids 90-93 letters Uniprot See below for protein sequence

Botulinum Neurotoxin A (Botox) protein sequence

>sp|P0DPI0|BXA1_CLOBO Botulinum neurotoxin type A OS=Clostridium botulinum OX=1491 GN=botA PE=1 SV=1
MPFVNKQFNYKDPVNGVDIAYIKIPNVGQMQPVKAFKIHNKIWVIPERDTFTNPEEGDLNPPPEAKQVPVSYYDSTYLSTDNEKDNYLKGVTKLFERIYSTDLGRMLLTSIVRGIPFWGGSTIDTELKVIDTNCINVIQPDGSYRSEELNLVIIGPSADIIQFECKSFGHEVLNLTRNGYGSTQYIRFSPDFTFGFEESLEVDTNPLLGAGKFATDPAVTLAHELIHAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYYYNKFKDIASTLNKAKSIVGTTASLQYMKNVFKEKYLLSEDTSGKFSVDKLKFDKLYKMLTEIYTEDNFVKFFKVLNRKTYLNFDKAVFKINIVPKVNYTIYDGFNLRNTNLAANFNGQNTEINNMNFTKLKNFTGLFEFYKLLCVRGIITSKTKSLDKGYNKALNDLCIKVNNWDLFFSPSEDNFTNDLNKGEEITSDTNIEAAEENISLDLIQQYYLTFNFDNEPENISIENLSSDIIGQLELMPNIERFPNGKKYELDKYTMFHYLRAQEFEHGKSRIALTNSVNEALLNPSRVYTFFSSDYVKKVNKATEAAMFLGWVEQLVYDFTDETSEVSTTDKIADITIIIPYIGPALNIGNMLYKDDFVGALIFSGAVILLEFIPEIAIPVLGTFALVSYIANKVLTVQTIDNALSKRNEKWDEVYKYIVTNWLAKVNTQIDLIRKKMKEALENQAEATKAIINYQYNQYTEEEKNNINFNIDDLSSKLNESINKAMININKFLNQCSVSYLMNSMIPYGVKRLEDFDASLKDALLKYIYDNRGTLIGQVDRLKDKVNNTLSTDIPFQLSKYVDNQRLLSTFTEYIKNIINTSILNLRYESNHLIDLSRYASKINIGSKVNFDPIDKNQIQLFNLESSKIEVILKNAIVYNSMYENFSTSFWIRIPKYFNSISLNNEYTIINCMENNSGWKVSLNYGEIIWTLQDTQEIKQRVVFKYSQMINISDYINRWIFVTITNNRLNNSKIYINGRLIDQKPISNLGNIHASNNIMFKLDGCRDTHRYIWIKYFNLFDKELNEKEIKDLYDNQSNSGILKDFWGDYLQYDKPYYMLNLYDPNKYVDVNNVGIRGYMYLKGPRGSVMTTNIYLNSSLYRGTKFIIKKYASGNKDNIVRNNDRVYINVVVKNKEYRLATNASQAGVEKILSALEIPDVGNLSQVVVMKSKNDQGITNKCKMNLQDNNGNDIGFIGFHQFNNIAKLVASNWYNRQIERSSRTLGCSWEFIPVDDGWGERPL

GLP-1 protein sequence

>sp|P01275|GLUC_HUMAN Pro-glucagon OS=Homo sapiens OX=9606 GN=GCG PE=1 SV=3
MKSIYFVAGLFVMLVQGSWQRSLQDTEEKSRSFSASQADPLSDPDQMNEDKRHSQGTFTSDYSKYLDSRRAQDFVQWLMNTKRNRNNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRRDFPEEVAIVEELGRRHADGSFSDEMNTILDNLAARDFINWLIQTKITDRK

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

Botulinum Neurotoxin A (Botox) reverse translate

GLP-1 reverse translate

reverse translation of sp|P0DPI0|BXA1_CLOBO Botulinum neurotoxin type A OS=Clostridium botulinum OX=1491 GN=botA PE=1 SV=1 to a 3888 base sequence of most likely codons.
atgccgtttgtgaacaaacagtttaactataaagatccggtgaacggcgtggatattgcgtatattaaaattccgaacgtgggccagatgcagccggtgaaagcgtttaaaattcataacaaaatttgggtgattccggaacgcgatacctttaccaacccggaagaaggcgatctgaacccgccgccggaagcgaaacaggtgccggtgagctattatgatagcacctatctgagcaccgataacgaaaaagataactatctgaaaggcgtgaccaaactgtttgaacgcatttatagcaccgatctgggccgcatgctgctgaccagcattgtgcgcggcattccgttttggggcggcagcaccattgataccgaactgaaagtgattgataccaactgcattaacgtgattcagccggatggcagctatcgcagcgaagaactgaacctggtgattattggcccgagcgcggatattattcagtttgaatgcaaaagctttggccatgaagtgctgaacctgacccgcaacggctatggcagcacccagtatattcgctttagcccggattttacctttggctttgaagaaagcctggaagtggataccaacccgctgctgggcgcgggcaaatttgcgaccgatccggcggtgaccctggcgcatgaactgattcatgcgggccatcgcctgtatggcattgcgattaacccgaaccgcgtgtttaaagtgaacaccaacgcgtattatgaaatgagcggcctggaagtgagctttgaagaactgcgcacctttggcggccatgatgcgaaatttattgatagcctgcaggaaaacgaatttcgcctgtattattataacaaatttaaagatattgcgagcaccctgaacaaagcgaaaagcattgtgggcaccaccgcgagcctgcagtatatgaaaaacgtgtttaaagaaaaatatctgctgagcgaagataccagcggcaaatttagcgtggataaactgaaatttgataaactgtataaaatgctgaccgaaatttataccgaagataactttgtgaaattttttaaagtgctgaaccgcaaaacctatctgaactttgataaagcggtgtttaaaattaacattgtgccgaaagtgaactataccatttatgatggctttaacctgcgcaacaccaacctggcggcgaactttaacggccagaacaccgaaattaacaacatgaactttaccaaactgaaaaactttaccggcctgtttgaattttataaactgctgtgcgtgcgcggcattattaccagcaaaaccaaaagcctggataaaggctataacaaagcgctgaacgatctgtgcattaaagtgaacaactgggatctgttttttagcccgagcgaagataactttaccaacgatctgaacaaaggcgaagaaattaccagcgataccaacattgaagcggcggaagaaaacattagcctggatctgattcagcagtattatctgacctttaactttgataacgaaccggaaaacattagcattgaaaacctgagcagcgatattattggccagctggaactgatgccgaacattgaacgctttccgaacggcaaaaaatatgaactggataaatataccatgtttcattatctgcgcgcgcaggaatttgaacatggcaaaagccgcattgcgctgaccaacagcgtgaacgaagcgctgctgaacccgagccgcgtgtataccttttttagcagcgattatgtgaaaaaagtgaacaaagcgaccgaagcggcgatgtttctgggctgggtggaacagctggtgtatgattttaccgatgaaaccagcgaagtgagcaccaccgataaaattgcggatattaccattattattccgtatattggcccggcgctgaacattggcaacatgctgtataaagatgattttgtgggcgcgctgatttttagcggcgcggtgattctgctggaatttattccggaaattgcgattccggtgctgggcacctttgcgctggtgagctatattgcgaacaaagtgctgaccgtgcagaccattgataacgcgctgagcaaacgcaacgaaaaatgggatgaagtgtataaatatattgtgaccaactggctggcgaaagtgaacacccagattgatctgattcgcaaaaaaatgaaagaagcgctggaaaaccaggcggaagcgaccaaagcgattattaactatcagtataaccagtataccgaagaagaaaaaaacaacattaactttaacattgatgatctgagcagcaaactgaacgaaagcattaacaaagcgatgattaacattaacaaatttctgaaccagtgcagcgtgagctatctgatgaacagcatgattccgtatggcgtgaaacgcctggaagattttgatgcgagcctgaaagatgcgctgctgaaatatatttatgataaccgcggcaccctgattggccaggtggatcgcctgaaagataaagtgaacaacaccctgagcaccgatattccgtttcagctgagcaaatatgtggataaccagcgcctgctgagcacctttaccgaatatattaaaaacattattaacaccagcattctgaacctgcgctatgaaagcaaccatctgattgatctgagccgctatgcgagcaaaattaacattggcagcaaagtgaactttgatccgattgataaaaaccagattcagctgtttaacctggaaagcagcaaaattgaagtgattctgaaaaacgcgattgtgtataacagcatgtatgaaaactttagcaccagcttttggattcgcattccgaaatattttaacagcattagcctgaacaacgaatataccattattaactgcatggaaaacaacagcggctggaaagtgagcctgaactatggcgaaattatttggaccctgcaggatacccaggaaattaaacagcgcgtggtgtttaaatatagccagatgattaacattagcgattatattaaccgctggatttttgtgaccattaccaacaaccgcctgaacaacagcaaaatttatattaacggccgcctgattgatcagaaaccgattagcaacctgggcaacattcatgcgagcaacaacattatgtttaaactggatggctgccgcgatacccatcgctatatttggattaaatattttaacctgtttgataaagaactgaacgaaaaagaaattaaagatctgtatgataaccagagcaacagcggcattctgaaagatttttggggcgattatctgcagtatgataaaccgtattatatgctgaacctgtatgatccgaacaaatatgtggatgtgaacaacgtgggcattcgcggctatatgtatctgaaaggcccgcgcggcagcgtgatgaccaccaacatttatctgaacagcagcctgtatcgcggcaccaaatttattattaaaaaatatgcgagcggcaacaaagataacattgtgcgcaacaacgatcgcgtgtatattaacgtggtggtgaaaaacaaagaatatcgcctggcgaccaacgcgagccaggcgggcgtggaaaaaattctgagcgcgctggaaattccggatgtgggcaacctgagccaggtggtggtgatgaaaagcaaaaacgatcagggcattaccaacaaatgcaaaatgaacctgcaggataacaacggcaacgatattggctttattggctttcatcagtttaacaacattgcgaaactggtggcgagcaactggtataaccgccagattgaacgcagcagccgcaccctgggctgcagctgggaatttattccggtggatgatggctggggcgaacgcccgctg

reverse translation of sp|P01275|GLUC_HUMAN Pro-glucagon OS=Homo sapiens OX=9606 GN=GCG PE=1 SV=3 to a 540 base sequence of most likely codons.
atgaaaagcatttattttgtggcgggcctgtttgtgatgctggtgcagggcagctggcagcgcagcctgcaggataccgaagaaaaaagccgcagctttagcgcgagccaggcggatccgctgagcgatccggatcagatgaacgaagataaacgccatagccagggcacctttaccagcgattatagcaaatatctggatagccgccgcgcgcaggattttgtgcagtggctgatgaacaccaaacgcaaccgcaacaacattgcgaaacgccatgatgaatttgaacgccatgcggaaggcacctttaccagcgatgtgagcagctatctggaaggccaggcggcgaaagaatttattgcgtggctggtgaaaggccgcggccgccgcgattttccggaagaagtggcgattgtggaagaactgggccgccgccatgcggatggcagctttagcgatgaaatgaacaccattctggataacctggcggcgcgcgattttattaactggctgattcagaccaaaattaccgatcgcaaa

3.3 Codon optimization.

You optimize codon usage to throttle, declare, and describe the desired output protein + yield. You are essentially optimizing the manufacturing part of the cell. I chose E. Coli because that is what the Peptide 2.0 website defaults to, although there is definitely opportunity to choose something better (though I think there is opportunity to create a neural network for optimizing the cell <> protein output yield). //TODO: optimize cell choice based on output protein, input cost, and quality

Cell → DNA sequence→ corresponding mRNA → specific production site → protein

Optimizations are done with VectorBuilder.

Botulinum Neurotoxin A (Botox)

Improved DNA[1]: GC=78.31%, CAI=0.87

>BOTULINUM_NEUROTOXIN_A
GCGACCGGTTGCTGCGGCACCACCACCGGTACCGGTGCGGCCTGTGCAGCGGCCTGCGCGGGTACCACCACCGCAGCCTGTACGGCCACCGCCGCGGCAGGTGCCACGTGCTGCGGCGGCACCGGTGCCGCCTGCGGCGGCTGCGGCACCGGCGGCGCCACCGCGACCACTGGCTGCGGCACCGCGACCGCCACCACCGCGGCCGCCGCGACCACCTGCTGCGGCGCGGCGTGTGGCACCGGCGGCGGTTGCTGTGCCGGCGCGACAGGCTGTGCGGGCTGCTGCGGCGGCACCGGCGCAGCCGCGGGCTGCGGCACCACTACCGCCGCAGCGGCGACGACCTGTGCGACCGCGGCCTGCGCGGCGGCGGCGACCACCACCGGTGGCGGCACCGGTGCGACCACCTGCTGTGGCGGCGCGGCGTGTGGCTGCGGCGCCACCGCCTGCTGCACCACCACCGCGTGCTGCGCCGCCTGTTGTTGCGGCGGCGCCGCGGGCGCGGCGGGCGGTTGCGGCGCGACCTGTACCGGCGCGGCCTGCTGTTGCGGTTGTTGCGGCTGCTGCGGCGGTGCAGCGGGTTGCGGCGCGGCCGCGTGTGCCGGCGGCACCGGTTGCTGCGGCGGTACCGGCGCGGGTTGTACTGCGACCACCGCGACCGGCGCGACCGCCGGCTGCGCGTGCTGTACCGCAACCTGCACCGGTGCGGGCTGCGCGTGTTGCGGCGCGACCGCGGCGTGCGGCGCGGCGGCGGCAGCGGGCGCGACCGCGGCGTGCACGGCCACCTGCACCGGTGCGGCGGCGGGCGGCTGCGGCACCGGTGCATGCTGTGCGGCCGCATGTGCGGGTACCACCACTGGTGCGGCCTGCGGCTGTGCCACGACGACCGCGACCGCGGGCTGTGCCTGCTGCGGCGCGACGTGCACCGGCGGTGGTTGCTGTGGCTGCGCGACCGGCTGTACCGGATGCACCGGCGCGTGCTGTGCGGGCTGCGCCACCACAGGCACCGGCTGCGGCTGCGGTGGCTGTGCCACCACCTGCTGCGGCACCACCACCACCGGCGGCGGTGGCTGCGGTGGTTGTGCGGGCTGTGCGTGTTGCGCGACCACCGGCGCTACCGCGTGCTGTGGTGCGGCCGGCACCGGCGCCGCGGCGGGCACCGGCGCGACCACCGGTGCGACCGCCTGCTGCGCGGCGTGCACCGGCTGCGCGACGACCGCGGCCTGCGGTACCGGCGCCACCACCTGCGCAGGCTGTTGTGGTGGCGCCACCGGTGGCTGCGCGGGCTGCACCGCGACCTGTGGCTGCGCCGGCTGCGGCGCCGCGGGCGCGGCGTGCACCGGCGCGGCGTGCTGCACCGGCGGCACCGGCGCCACAACCGCGACCACCGGCGGCTGCTGCTGCGGCGCAGGCTGCGGCTGCGGCGGAGCCACGGCGACCACCGCGACCACCTGCGCAGGTACCACTACCGGCGCGGCGACCGGCTGTGCGGCGGCCGCTGGTTGCACCACCACCGGAGGCTGTTGCGCAACCGGTGCCGCGGGCACCGGCTGTACGGGTGCCGCCTGCTGCACCGGCGCCTGCTGTTGTGGTTGTGCCGCCTGTGGTGGCTGCACTGCGACCGGCGGCTGCGCGGGCTGCGCGTGCTGCTGCGCGGGTACCGCCACCGCGACCACTTGCGGCTGTACCACCACCGCAGGCTGCTGCTGTGGCGGCGCAACCACCACCACCGCCTGTTGCACCACCACGGGCGGTTGCACCACCACCGGCGCAGCCGGCGCCGCCGCCGGCTGCTGCACCGGCGGTGCCGCAGGCACCGGCGGCGCAACTGCCTGTTGCGCGGCCTGCTGCTGCGGCTGCACCGGCTGCACGGGTGGCGGCTGCGGCTGTGGCGGCGGCTGTGCAGCGGCAACCACTACCGGCTGCGGCGCATGCTGTGGCGCCACGTGCTGTGGTGGTTGTGGCGGCACCGGCGCGTGCTGTTGCACGGGTGGTTGCGGCTGCGCCACCGGTGCGGCGTGCACCGGAGCGACCACCTGTGCGACCGGCTGCGGCGGCGGCTGCTGCGCGACCTGCGGCTGCTGCACCGGTACCGCGACCGGTGGTTGCGCCACCACCGGTTGCGGCGCCACCACCGCGGCGTGCTGCTGCGGTGCCGCGTGCTGCGGGTGCGGCACCGGCACCACCACCGCGGCAGCGGGTACCGGCGCGGCGTGCGCGTGCTGCGCGGCCTGTGGCTGCGGCACCGCCACCACGGCGACCGGCGCGGCGGCCACGGGTGCGGGTTGCGGCGGCTGCTGCACCGGCGGTGCCGCGGGTACCGGCGCGGGCTGTACCACCACCGGCGCAGCCGGTGCGGCGTGCACCGGCTGCGGTTGTGCGTGTTGTACCACCACCGGCGGCTGCGGCGGTTGCTGCGCGACCGGCGCGACCGGCTGCGGCGCGGCGGCGACCACCACCGCCACAACCGGCGCGACCGCGGGTTGCTGTACCGGCTGCGCAGGTGGCGCGGCGGCAGCGTGCGGCGCGGCCACCACCACCTGCGGCTGCTGTACCGGCACCGCCACCACCGCCACTACGGCCACCGCTGCCTGCGCGGCAGCGACCACCACCGCCGCGGCGGGCGCGACCGCCACCACCGGTTGCGGTGCGGGCTGTGCCTGTTGCTGCACCGGCGCAGCCTGTGCCGCAGCGGGTTGTGGCGCGGCCGCCGCCGGCTGCGCAACTACGGGCACGGGCGGCGGCTGCGCGTGTTGCGCCTGCTGCGGTTGCGGAGCCGGCTGCTGCACCGGCTGCGCGGGCACCGCCACCGCGACTGGCGCGGCCGCAGCCGCCTGCGGCACCGGCACCACCACCGCGGCGGCGGGCGCGGCGGCCGCAGCCACGGCGACCTGTACTGGCTGCACCGGTGCGGGTTGTGGTGCGGCGGGCGCAACCGCGTGCTGCGCAGGCTGCGGCGGTTGCGCGGCGGCGACAACCACCGCGGGCTGCGGCACCGGCGGCGCGACCGCCGCCGCGTGTACCGGTGCGGCGGCCACCACCACTGGCGCGACCGCGGCGGCCTGTACGGGCACCGCCACCGCGGCGGCCGCCACCGGCTGCACCGGCGCATGCTGCGGCGCCGCGACGACCACCACCGCGACCGCCTGCTGCGGCGCGGCCGGCGCGACGGCGGCTTGTACCACCACTGGCACGGGCGCCGCGGCGACTACCACCACCACCACCGCGGCGGCGGGTACCGGCTGCACCGGCGCCGCGTGTTGTGGCTGTGCGGCGGCGGCATGTTGTACCGCCACCTGTACCGGCGCCGCGACCTGCACTGGTGCGGCGTGCGCGGCGGCTGGTGGTTGCGGCGCGGCGGCCTGCGGTGGTGCGGCCTGTGCCACCACCACCGCCACCACCGGCTGTGGCACCGGCGGCTGCACCGGCGGCGGCTGTGGTGCGGCGGCCGGTGGTTGCTGTGGCTGCGGCGGCTGTTGCGGCTGTTGTGGCTGCGGTGCCACCACCACCACCTGCTGTGGTGGCGCCGCCGGCGCGGCTGGCACCGGTGGCTGCGGCGCGACCACGGGCACCGGCGGCGCAGCGGGCGCCGCGTGCACGGGCGGTGGCTGTTGCGGCTGCTGCGGTTGCTGTGCCACCGGCTGCGGTGGTGCGACCGGGGGTTGCGCGGGTTGCACCACTACCGCGGGCTGTGGTGCAACCGGTGCGGCCGCCACCGGCGCGGCCTGTGCCTGTTGCGCGACGACTTGTACCGGTGGCGCAACCGCGGCGTGCTGTACCGGCGGCTGCGGCGGTTGCGGCTGCGGCTGCGGGGCCACCACCACCACCGCAACCACCGCGGCGTGCGGTGGCTGTACCGGCGCGACGACCTGCGCGGGCGCCTGCTGCGCAGCGGCGGCGACCACCGCATGCTGCGGCGCCACGTGCGGCTGCGCGGCAGCCTAA

GLP-1

>GLP1
GCAACTGGTGCGGCAGCCGCGGGCTGCGCGACCACAACCGCCACCACTACTACGGGCACCGGCGGCTGTGGCGGCGGTTGCTGTACCGGCACCACCACCGGCACTGGCGCCACCGGTTGTACCGGCGGCACCGGCTGTGCGGGCGGCGGTTGCGCGGGTTGCACTGGCGGCTGCGCCGGCTGTGGCTGCGCCGGCTGTTGTACCGGCTGTGCGGGTGGCGCGACCGCGTGCTGCGGTGCCGCCGGCGCGGCGGCGGCGGCGGCCGGCTGCTGCGGCTGCGCGGGCTGCACCACCACCGCGGGCTGTGGCTGCGGTGCGGGTTGTTGTGCAGGCGGCTGTGGCGGCGCAACGTGCTGTGGTTGCACCGGCGCCGGTTGTGGAGCGACTTGTTGTGGCGGCGCGACGTGTGCGGGTGCCACCGGCGCGGCCTGCGGCGCGGCGGGCGCGACCGCGGCGGCGTGCGGCTGCTGTGCGACTGCCGGTTGCTGTGCGGGCGGCGGTTGTGCGTGCTGTACCACCACCGCGTGTTGCGCCGGCTGCGGCGCGACCACCGCGACTGCCGGCTGTGCAGCGGCGACCGCGACCTGCACCGGCGGCGCCACCGCGGGCTGCTGTGGTTGTTGCGGCTGCGGTTGCGGCTGCGCGGGCGGCGCGACCACCACCACCGGCACGGGTTGCGCGGGTACCGGCGGTTGTACCGGTGCGACTGGTGCGGCGTGCGCGTGCTGTGCAGCGGCGTGCGGCTGCGCCGCCTGCTGCGGCTGCGCCGCGTGCGCGGCGTGCGCGACCACAGGCTGTGGCGCGGCGGCGTGTGGTTGCTGCGCCACCGGCGCCACCGGCGCCGCGACCACGACCGGCGCCGCGTGTGGCTGCTGTGCGACCGGCTGCGGTGGCGCGGCCGGCGGCTGCGCATGCTGCACCACCACCGCGTGCTGCGCGGGCTGTGGCGCGACCGGCACCGGTGCGGGTTGTGCCGGTTGCACCGCCACCTGCACCGGTGGCGCGGCGGGCGGCTGCTGCGCGGGCGGCTGCGGCGGCTGTGGCGCGGCGGCGGGTGCCGCCACCACCACCGCCACCACCGGCTGCGGTACCGGCGGCTGCACCGGCGGCACCGGTGCGGCCGCAGGCGGCTGCTGCGGCTGTGGCGGCTGCTGTGGTTGTTGCGGCTGCGGTGCGACGACCACCACCTGCTGCGGCGGCGCGGCCGGCGCCGCGGGCACCGGTGGCTGCGGCGCGACCACGGGCACCGGTGGCGCGGCCGGCGCGGCGTGTACCGGTGGCGGCTGCTGCGGCTGTTGCGGCTGTTGCGCGACGGGCTGTGGTGGCGCGACCGGCGGCTGTGCGGGCTGCACCACTACCGCGGGTTGCGGCGCAACCGGTGCCGCGGCGACCGGTGCGGCGTGCGCGTGCTGCGCGACCACCTGTACCGGCGGCGCGACTGCGGCGTGCTGCACCGGTGGCTGTGGCGGCTGTGGTTGCGGATGCGGCGCGACCACGACCACCGCCACCACGGCGGCGTGTGGCGGCTGCACCGGTGCGACCACCTGCGCGGGCGCCTGTTGCGCCGCGGCTGCGACCACTGCGTGTTGCGGCGCGACCTGTGGCTGTGCGGCGGCGTAA

3.4 You have a sequence! Now what?

Cell-dependent

This method uses living cells to transcribe DNA into mRNA and translate it into a functional protein. Botulinum Neurotoxin A (Botox) is typically produced in Clostridium botulinum, but apparently, recombinant versions can be expressed in safer bacterial or mammalian systems. GLP-1, on the other hand, is often produced in E. coli or yeast for industrial-scale peptide drug production.

The process for doing this is as follows:

The codon-optimized gene is introduced into an expression vector (a plasmid) under the control of a strong promoter
The vector is introduced into a host system (like E-coli, Saccharomyces cerevisiae, or mammalian cells like HEK293 or CHO)
The host cell machinery transcribes the gene into mRNA and translates it into protein using its ribosomes
Proteins fold and undergo post-translational modifications in eukaryotic systems
The protein is extracted using affinity chromatography (His-tag or antibody-based purification).

are there more ways?

Mass spectrometry and/or Western blotting verifies the structure, purity, and function as a a QA process

Cell-free

This method uses ribosomes and enzymes outside of living cells to directly produce protein.

The process is as follows:

Lysates from E. Coli, rabbit reticulocytes, or wheat germ provide ribosomes, tRNAs, and enzymes
Linear or plasmid DNA or mRNA containing the gene of interest is added
A coupled reaction (catalyst-enabled) reaction converts DNA into mRNA and translates it into protein using ribosomes
Proteins fold

*this is way more efficient and controllable, cell-free manufacturing + personalized cell QA testing seems like the best way forward

3.5 How does it work in nature/biological systems?

A single gene code for multiple proteins at the transcriptional level through methods like alternative splicing, RNA editing (post translation), and alternative translation initiation.
DNA sequence containing exons and introns (in eukaryotes) → U replaces T in mRNA → protein (3 nucleotides for one amino acid)

//TODO: actually align the 3 in Benchling

Part 4: DNA Read/Write/Edit

4.1 DNA Read

What DNA would you want to sequence (e.g., read) and why?

I would sequence the full human microbiome and metagenome from diverse environments, including the human gut, every organ, wastewater, soil, and extreme environments. Ideally we could also digitalize all input genomes from the food we consume, air inside + outside the home, animals in the home, all surfaces + their bacteria, and everyone that that individual interacts with. The goal would be to build a humanity-level immunity network, to detect disease at the source and stop it. You want to shorten the gap between disease creation and humanity consumption. Obviously, to do this, the details below are insufficient.

In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions:

Oxford Nanopore (ONT) + PacBio HiFi hybrid sequencing because it combines ultra-long reads for genome assembly & structural variants (ONT) and high accuracy for single-molecule resolution (PacBio HiFi)

3rd generation- no need to do PCR amplification.

What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.

Sample collection + DNA/RNA extraction using Qiagen or Nanobind
Fragmentation for shorter HiFi reads OR left intact for ONT ultra-long reads, depends on the sequence length from what source
Adapter ligation to enzymatically attach sequencing adapters
Prepare the library by inserting barcodes for multiplexing samples
Load into the ONT/PacBio machines

What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?

DNA is fed through a biological nanopore
Changes in electrical current detect nucleotide identity
NN (Bonito, Guppy) convert current fluctuations into sequence data

PacBio HiFi

Single-molecule real-time sequencing (ultra fine-grained), where circular DNA moves through polymerase
Fluorescently labeled nucleotides emit light as they are incorporated
HiFi reads are generated by sequencing the same molecule multiple times

What is the output of your chosen sequencing technology?

Ultra-long reads (≤ 4MB)
FAST5 → FASTQ
Epigenetic modifications are directly detectable (methylation)

PacBio HiFi

Highly accurate (99.9%) for short to mid reads (10-25kb)
Output is FASTQ / BAM / CCS (circular consensus sequencing)

4.2 DNA Write

What DNA would you want to synthesize (e.g., write) and why? Something that resembles a computer- meaning a sensor, genetic circuit for computation on that input, and a response This circuit would function as a programmable sensor inside human cells, capable of detecting and responding to specific biomarkers (inflammatory cytokines, cancer markers, environmental toxins). The core elements include:

Sensor module: a synthetic promoter that activates transcription in response to specific molecules (inflammatory cytokines like IL-6 or TNF-α)
Processing module: a CRISPR-based logic gate (dCas9-based transcriptional control) that enables Boolean computation (AND, OR, NOT gates)
Response module: a genetically encoded therapeutic output, such as an anti-inflammatory peptide, apoptosis inducer (for cancer), or fluorescent reporter for tracking

Promoter -> IL-6 response element -> dCas9 + sgRNA (targeting repressor) -> GFP Reporter

What technology or technologies would you use to perform this DNA synthesis and why?

Twist Bioscience DNA Synthesis (in-silico DNA writing)
Cell-free TxTL (transcription-translation) for rapid testing
Golden Gate & Gibson Assembly for assembly of large constructs (or something newer, not familiar enough yet with the algorithms)

What are the essential steps of your chosen sequencing methods?

Oligos are synthesized on silicon chips using phosphoramidite chemistry, so thousands can be synthesized in parallel
Small oligos are error-corrected and assembled using ligation or PCR
Assembled DNA is cloned into vectors
Synthesized DNA is delivered as plasmids or linear DNA, ready for direct transformation into cells or cell-free systems

What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?

Speed	Oligo synthesis + assembly takes 1-2 weeks, nowhere near real-time
Scalability	Limited to 300kb per construct, requiring modular assembly for larger genomes
Accuracy	Error rate = 1:10,000 bases, requiring sequencing verification (NGS, Sanger)
Cost	Expensive for long constructs, but scalable for small genetic circuits

4.3 DNA Edit

What DNA would you want to edit and why? I would edit human immune system genes to enhance disease resistance and longevity. I constantly notice friends and family members developing allergies and inflammation, including myself. Some genes from a quick search:

CCR5 Δ32 mutation: deleting CCR5 for HIV immunity and viral resistance
PCSK9 knockout: disrupting PCSK9 to lower LDL cholesterol, preventing cardiovascular disease, the #1 cause of death globally
MYC and TP53 regulation: fine-tuning expression of these genes to increase regenerative capacity while suppressing cancer (math problem)
Mitochondrial gene editing: correcting mutations associated with aging and metabolic diseases, which I believe is the engine of life
Enhanced DNA repair mechanisms: increasing expression of genes like SIRT6, FOXO3, and GADD45 to extend lifespan by reducing genomic instability. It would be a positive for humanity if we could systematically accelerate healing

What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:

Prime Editing: For precise, efficient edits without double-strand breaks
Base Editing: For single-letter mutations with ultra-high accuracy
CRISPR-Cas9 & Cas12a: for targeted gene knockouts and insertions

How does your technology of choice edit DNA? What are the essential steps?

Prime Editing (human gene therapy)	Base Editing (SNPs, fixing disease mutations)	CRISPR-Cas9 & Cas12a (disrupting disease-causing mutations)
Uses a Cas9 nickase fused to a reverse transcriptase	Uses a Cas9 nickase fused to a deaminase. enzyme	Cas9 has larger double stranded breaks, has knockout/insertion via HDR
A pegRNA encodes the desired edit	Converts C → T or A → G mutations without cutting DNA	Cas12a has higher specificity and leaves staggered cuts, improving large DNA insertions
RT writes the new DNA directly into the target site without causing double stranded breaks	High accuracy, low off-target effects, minimal errors
No need for donor DNA templates, reducing error rates

What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?

Gene target selection: identify mutation sites using NGS + clinical data
Guide RNA design: generate pegRNAs and sgRNAs
Delivery method selection: viral (AAV, lentivirus) or non-viral (lipid nanoparticles, electroporation) depending on the cell

What are the limitations of your editing methods (if any) in terms of efficiency or precision?

Prime editing: low efficiency (~5-50%) in some cell types, delivery still being optimized
Base editing: only allows C→T and A→G changes, limiting its use for all mutations
CRISPR-Cas9: higher risk of off-target mutations and requires a repair template for precise insertions

Notes

large molecules need models to interact with them, or specialized hardware + software to move them around more easily
codon optimization is used for maximizing the output protein yield

cell → DNA sequence→ corresponding mRNA → specific production site → protein
deep learning codon optimization: Codon optimization with deep learning to enhance protein expression

RNN variant that learns contextual dependencies between DNA sequences to predict codon sequences, optimizes for consistency
17-40 hours of training (why the variance?)
test accuracy: 52%
Codon Adaptation Index of ~0.98, outperforming Genewiz (0.83) and ThermoFisher (0.93)
codon box encoding: groups synonymous codons into sets based on their nucleotide composition

sequence funnel: protein → protein sequence → reverse translation → codon optimization → organism
T7 and SP6 RNA polymerases: DNA-dependent RNA polymerases derived from bacteriophages T7 and SP6

they recognize specific promoter sequences (T7 or SP6) and transcribe RNA with high specificity and efficiency
widely used in in vitro transcription systems for synthesizing mRNA, ribozymes, and RNA probes. T7 RNA polymerase is more common due to its high processivity and fast transcription rate

DNA synthesis is mainly for smaller reads- lots of opportunity to expand industrialization
codon optimization- optimize your sequence for the specific host that you are using

Week 2 Homework