Bioinformatics Code Challenges with Python

Bioinformatics Code Challenges with Python

Summary

Problem: Manual tasks in molecular biology, like designing PCR fragments or normalizing sample concentrations, are error-prone, repetitive, and prone to human error at scale.

Approach: We build Python scripts for lab automation:

  • Gibson fragment design: Splits plasmid sequences into Gibson-compatible 400–800 bp fragments with 20 bp overlap.
  • Primer design: Scans sequences (e.g., Melittin in pET-28a) to yield 18–24 bp primers within a Tₘ window of 55–65 °C using SantaLucia’s nearest-neighbor model (via Biopython).
  • Pipetting simulator: Calculates sample + water volumes to achieve 20 ng/µL for 96-well plates, capped at 50 µL.

Impact: This automates everyday molecular biology preparation: lectures standard protocols into scripts that reduce human error and streamline wet‑lab readiness—especially valuable for high-throughput or automated lab workflows.

Challenge 1: PCR Fragment Design for Gibson Assembly

GitHub repo: Embed GitHubEmbed GitHub

What It Does

This script splits a DNA sequence (e.g. a plasmid) into 3 PCR fragments that are:

  • 400–800 bp long
  • Connected by 20 bp overlaps for Gibson Assembly

It simulates the process of prepping fragments for seamless DNA assembly in the lab.

Why It Matters

Gibson Assembly requires overlapping DNA ends so enzymes can stitch fragments into a continuous sequence.

This tool automatically:

  • Chooses valid fragment sizes
  • Adds required overlaps
  • Verifies the overlaps match exactly

Example Input

We use a synthetic antivenom plasmid (pET-28a backbone + SHRT gene insert) as our DNA input (/data/pET28a_SHRT.fasta in the repo):

📄 pET28a_SHRT.fasta

How to Use

  1. Run the script:
python pcr_fragment_design/design_gibson_fragments.py
  1. It will:
    • Load the FASTA
    • Split it into 3 Gibson-ready fragments
    • Print fragment coordinates and sequences
  2. Output:
  3. 📄 Gibson_Fragment_Design.csv with fragment metadata

    Verified 20 bp overlaps between adjacent fragments

Challenge 2: Primer Design with Melting Temperature

GitHub repo: Embed GitHubEmbed GitHub

What It Does

This script finds a pair of PCR primers that:

  • Are 18–24 bp long
  • Have melting temperatures (Tm) between 55–65°C
  • Flank a 500 bp amplicon in a plasmid sequence

It scans the entire DNA sequence and returns the first valid 500 bp region that meets these constraints.

Why It Matters

To amplify DNA by PCR, you need two primers that:

  • Bind to opposite ends of your target region
  • Are thermodynamically stable (right Tm)
  • Point toward each other (forward/reverse)

This script models real-world primer design tools like Primer3, and ensures primers are well-behaved under lab conditions.

Example Input

We use a clean synthetic plasmid (pET28a_Melittin_clean.fasta) with the Melittin bee venom gene inserted into a pET-28a expression vector:

📄 pET28a_Melittin_clean.fasta

This plasmid is designed to express the Melittin peptide in E. coli under a T7 promoter.

How Melting Temperature (Tm) Is Calculated

We use the SantaLucia 1998 Nearest-Neighbor Thermodynamic Model, which calculates Tm using this formula:

Tm=ΔHΔS+Rln(C)273.15T_m = \frac{\Delta H}{\Delta S + R \cdot \ln(C)} - 273.15

Where:

  • ΔH and ΔS are summed over each dinucleotide pair (e.g. AA/TT, GC/CG)
  • R is the gas constant
  • C is strand concentration

This model is used by Tm_NN() from Biopython:

from Bio.SeqUtils.MeltingTemp import Tm_NN
Tm_NN("ATGCGTACGTAGCTAGCTA")

How to Use

  1. Run the script:
python primer_design.py --fasta pET28a_Melittin_clean.fasta --amplicon_length 500
  1. The script will:
    • Slide a 500 bp window along the DNA
    • Scan for forward/reverse primers at the window edges
    • Return the first pair with valid Tm
  2. Output:
    • Primer sequences
    • Positions
    • Melting temperatures
    • Amplicon boundaries

Challenge 3: Simulating a Robotic Pipetting Protocol

GitHub repo: Embed GitHubEmbed GitHub

We are basically diluting samples to reach a specified quantitative value: 20 ng/µL, but total volume ≤ 50 µL. This one was really, really simple. Improvement can be made based on the complexity of input sample data, but the challenge called for a simple CSV.

image

What It Does

In molecular biology labs, normalizing DNA concentrations across 96 samples is a common task—especially before pooling for sequencing. Robots usually handle this, but they need instructions.

This script calculates how much sample and water to pipette to reach a final concentration of 20 ng/µL, based on measured DNA concentrations per well.

Inputs & Outputs

Input:

  • CSV with sample IDs and measured concentrations (e.g., 45.7 ng/µL)

Output:

  • New CSV with pipetting instructions:
    • Volume of DNA sample (µL)
    • Volume of water (µL)
    • Rounded to nearest 0.1 µL
    • Max total volume = 50 µL

Why It Matters

This kind of script saves hours in the lab and reduces pipetting errors. It’s trivial for a computer, but error-prone when done manually—especially at scale.

If sample concentration = 40 ng/µL, to reach 20 ng/µL in 50 µL:

Sample volume = 25 µL
Water volume = 25 µL

The script outputs that automatically for all 96 samples. You can imagine this happening, but at scale, with tens of millions of robots serving tens of billions of people.

Next Steps

  • Integrate with a real robot's CSV format
  • Add error checking for low-concentration edge cases
  • Simulate batch pooling for NGS workflows

Final Thoughts

This was a quick one, but even small scripts like this can become part of much larger automation pipelines in biotech. If you're working in a wet lab, automating these tasks is entirely about leverage and predictability.