- Part A: Deepening our understanding of proteins
- Part B: Analyzing and visualizing a target protein
- Part C. Using AI-based protein tools
Part A: Deepening our understanding of proteins
How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
1 amino acid = 100 Daltons, 1 Dalton = 1g/mol
500g / 100g/mol * 6x10^23 (Avogadroās number) = 3x10^24
Real meat is only 20-25% proteins, so you actually get fewer
Why do humans eat beef but do not become a cow, and eat fish but do not become fish?
When we eat meat, or the cells that have specified themselves as those muscle/fat cells in whatever we are eating, the body breaks them down into their chemical constituents. The lowest common denominator between us and our food are the building blocks of amino acids that we repurpose for our own metabolic processes. We use human DNA to rebuild our human cells. The basic building blocks from cow and fish help restore those that have been used.
Why are there only 20 natural amino acids?
Evolution selected 20 because it offered enough chemical diversity to build complex proteins, but low enough so as to avoid over-complexity. Physics and friction with our environment provided the rest. We do not have a strict proof on why yet, from what I have researched.
Can you make other non-natural amino acids? Design some new amino acids.
You can create new ones by tweaking side chains and backbones. We can swap an amino acidās side chain for something that adds a new function- like fluorescence, metal binding, or altered reactivity. We can modify alanine by attaching a benzene ring with an electron-donating group, or add a bulky group to create a kink in a protein chain. For example, we can create a fluoro-alanine, where a hydrogen is replaced by fluorophore.
Where did amino acids come from before enzymes that make them, and before life started?
They likely formed abiotically on early Earth- energy sources like lightning or UV light drove chemical reactions among simple gases, producing amino acids (as shown in Miller-Urey experiments). Some may have come from space meteorites.
If you make an alpha-helix using D-amino acids, what handedness (right or left) would you expect?
Naturally, proteins are made of L-amino acids and form right-handed alpha helices. If we use D-amino acids, the helix turns into a left-handed form because itās the mirror image of the L-amino acid helix that forms right-handed alpha helices.
Can you discover additional helices in proteins?
Yes, but it would require structural analysis to identify new forms. Proteins have more helix types than just the classic alpha helix. For example, there are 3āā helices (tighter, with 3 residues per turn) and pi helices (looser, about 4.4 residues per turn), and rare left-handed helices.
Why are most molecular helices right-handed?
Right-handed helices form because the geometry of L-amino acids fits best in a right-handed twist. Their bond angles and dihedral constraints (as shown in Ramachandran plots) favor a structure that minimizes steric clashes and optimizes hydrogen bonds. Natureās low-energy, stable option.
Why do źµ-sheets tend to aggregate? What is the driving force for źµ-sheet aggregation?
They aggregate because the aligned strands form a network of inter-molecular hydrogen bonds, and hydrophobic side chains stick together. This lowers free energy and drives tight, stable stackingāoften seen in amyloid fibrils. The driving forces behind this include backbone hydrogen bonding, hydrophobic interactions, and van der Waals forces.
Why do many amyloid diseases form źµ-sheets? Can you use amyloid źµ-sheets as materials?
Amyloid diseases happen when proteins misfold into β-sheets, forming stable hydrogen bonds and hydrophobic interactions. This aggregates into strong but insoluble fibrils, which contribute to disease. Because they are robust, self-assembling structures, they are being explored as materials in nanotech and biomaterials. Controlling them is tricky.
Design a źµ-sheet motif that forms a well-ordered structure.
Take four 8-residue strands with an alternating hydrophobicāpolar pattern (e.g., VāKāVāKāVāKāVāK). Connect strands with sharp β-turns (like D-ProāGly) to enforce the sheet geometry. Cap the ends (acetyl and amide) to stabilize the hydrogen bonding network. The design yields an amphipathic sheet: one side packs hydrophobics, the other shows charged residues, minimizing clashes and ensuring order.
Part B: Analyzing and visualizing a target protein
- I selected H5 Hemagglutinin because it was featured as the Molecule of the Month on PDB. I thought it would be interesting to learn about the Bird Flu.
- Amino acid sequence: UniProt link
- Length = 566
- Most frequent amino acid: N
- Protein sequence
- Protein sequence homologs: at least 100, mainly from Influenza A strains, with extremely high query coverage (98-100%)
- Family: Hemagglutinin family, Orthomyxoviridae
- When was the structure solved?
- 2019
- Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution < 2.0 Ć )
- Not really- 2.39 Ć
- Are there any other molecules in the solved structure apart from protein?
- The solved structure of 6PD5 contains N-acetyl-D-glucosamine (NAG) as a ligand
- Does your protein belong to any structure classification family?
- Based on using 6PD5, the family is Influenza hemagglutinin headpiece.
- Open the structure of your protein in any 3D molecule visualization software.
- Examine and analyze your protein, visually:
- Visualize the protein as "cartoon", "ribbon" and "ball and stick".
- Color the protein by secondary structure. Does it have more helices or sheets?
- The protein has more β-sheets than α-helices (helices are red, sheets are blue).
- Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
- The hydrophobic residues (orange) appear to be mostly in the core of the protein, stabilizing the structure. The hydrophilic residues (pink) are predominantly on the outer surface, interacting with the solvent. This follows a typical hydrophobic core, hydrophilic surface pattern, consistent with a water-soluble protein.
- Visualize the surface of the protein. Does it have any "holes" (aka binding pockets)?
- Yes, the protein has binding pockets, which are likely important for its function of receptor binding or enzymatic activity.
MEEIVLLFAIVSLARSDQICIGYHANNSTKQVDTIMEKNVTVTHAQDILEKTHNGKLCSLNGVKPLILRDCSVAGWLLGNPMCDEFLNVPEWSYIVEKDNPVNGLCYPGDFNDYEELKHLLSCTKHFEKIRIIPRDSWPNHEASLGVSSACPYNGRSSFFRNVVWLIKKDNAYPTIKRSYNNTNKEDLLILWGIHHPNDAAEQTKLYQNPTTYVSVGTSTLNQRSIPKIATRPKLNGQSGRMEFFWTILKPSDTINFESNGNFIAPEYAYKIVKKGDSAIMKSGLEYGNCNTKCQTPIGAINSSMPFHNIHPLTIGECPKYVKSDRLVLATGLRNTPQRKRKKRGLFGAIAGFIEGGWQGMVDGWYGYHHSNEQGSGYAADKESTQKAIDGITNKVNSIIDKMNTQFEAVGKEFNNLERRIENLNKILEDGFLDVWTYNAELLVLMENERTLDFHEANVKSLYDKVRLQLKDNARELGNGCFEFYHKCDNECMESIRNGTYNYPQYSEEARLNREEISGIKLESMGIYQILSIYSTVASSLALAIMIAGLSFWMCSNGSLQCRICI
Structural information
Tutorial Here
Part C. Using AI-based protein tools
- What is a prediction model?
a. Close, but no. There is this large tail that comes off of it.
b. I couldnāt get the proper file format from Alphafold, so I just aligned the same file that I used from the PDB. Of course, the RMSD is 0Ć , since its the same file. But I learned what the metric is and how to do it!
c. Yes, it seems like there are low confidence scores in the center of it. These areas might need experimental validation.
d. Yes, the low-confidence regions (yellow/red in PyMOL) can impact engineering efforts because they are likely flexible/disordered, making them unreliable for structural stability. If they are in binding sites or active sites, they could affect functionality. AlphaFold predictions in these areas are less accurate, meaning they might not fold as expected in reality.
// TODO
- Sequence recovery models
Using a sequence recovery model (MPNN):
a) Generate sequence proposals for PDB ID: 1BCF chain A.
b) Fold 1-2 of the generated sequences using a protein structure model from Question 1.
c) Is there a way to enable the newly designed sequences to preserve their binding to di-ironĀ ? You can check page 34 of this reference link to find 1BCF(Di-iron binding motif).. To answer this question, all you have to do is keep some parts of the protein sequence of 1BCF constant - You can find which regions in the document linked above. Just search for 1BCF. [*Extra Credit*]
- Generative model
Using a Generative Model:
a) You are a scientist trying to design a new drug binder for COVID-19. Design a protein backbone that can bind to SARS-CoV-2 spike protein. Use PDB ID: 6M0J or any other target for identifying a binding pocket.
b) Generate sequences for your newly sampled backbone and fold 1 or 2 of them. Visualize them using your favorite protein visualization tool.
c) How can you rank and select the new protein sequences to test in the lab?
d) How can you experimentally verify if your newly designed binder binds to the target? [Eg: Yeast Surface Display, Degradation Assays etc]. [*Extra Credit*]
e)* If you design a binder that strongly binds to SARS-CoV-2, what's the next step in your design pipeline? What potential issues could arise from its application as a drug in humans? [*Extra Credit*]
f)* Here using RFdiffusion, we designed a mini-protein binder. However, many therapeutic protein binders designed are typically antibodies. What are some advantages of antibody binders?Ā [*Extra Credit*]
- How to improve enzyme thermostability
Engineering thermostability of enzymes:
a) Pick an enzyme you are interested in [eg: PETase].
b) Summarize the function of this protein.
c) Can you engineer a version of your protein that functions at high temperatures?
d)* How can you utilize machine learning tools for designing this protein?Ā [*Extra Credit*]
e)* How would you test the thermostability of your newly designed enzyme?Ā [*Extra Credit*]