9/17/25

GitHub repo: Embed GitHub

I had my genome sequenced by Nucleus and received 54 GB worth of files. In order to run some analyses, I started reading this book, but went off-script once I got the gist of it. The authors are great- they really took the time to understand everything. I recommend buying the book.

This write up is intended for software engineers who have some experience with AWS/cloud, so I am going to skip over a lot of implementation details and expect that you can fill the gaps. If you need help, just reach out to me, but ChatGPT should be able to answer everything pretty easily.

Lastly, I’ve provided this whole project as a public Terraform repo, so you can follow the README.md if you want to spin this up on your own cloud environment. I used the c7i.4xlarge CPU (Intel Xeon Platinum 8488C, 8 physical cores (16 vCPUs), 32 GB RAM), so you may want to change the size of it in the terraform.tfvars. However, this whole project should fit under the $100 in free credits they give you, especially if you do it in one sitting. I budgeted for 2 weeks of full compute, coming to a little over $100.

Project

To start, you need an AWS account. In order to qualify for the free $100 credits, you’ll need to sign up with a new email + credit card, ones that you have never used before with AWS.

Then, you will need to get your genome sequenced. I recommend Nucleus or Nebula Genomics. They allow you to download your genome, which they send in an email with one VCF file declaring your variants, and 16 FASTQ files, with both sides of your full genome (since it’s a double helix).

After downloading your whole genome, you will need to configure your AWS account locally with aws configure. You can install this CLI with homebrew, curl, etc.

You will then need to manually create an EC2 key-pair, either from the aws CLI or in the EC2 console page, then download it locally (I did it in the console). Add the name of it to ssh_key_name in the terraform.tfvars file.

You will then need your computer’s IP address, which you can get by running curl ifconfig.me if you are on Linux. Put that in the home_ip_cidr variable in the file terraform.tfvars (keep the /32 as it is).

Then just fill out the rest of the terraform.tfvars:

region              = "us-east-1"                # use whatever is closest to you
project_name        = "your-project-name"        # like "wgs-analysis-your-name"
home_ip_cidr        = "YOUR.IP.ADDRESS/32"       # your public IP with /32
ssh_key_name        = "your-ec2-keypair-name"
s3_bucket_name      = "your-unique-bucket-name"  # must be globally unique
instance_type       = "c7i.4xlarge"              # or "g5.2xlarge" for GPU
disk_gb             = 500
use_spot            = true
alert_email         = "your-email@example.com"
database_password   = "your-secure-database-password-here"
application_secret  = "your-application-secret-key-for-auth"

Once you have completed all of these steps, you then run terraform init, terraform plan, and terraform apply -auto-approve, and this will deploy the entire repo to your AWS account. Make sure that you have the free $100 credits in your account, because the second this is deployed, AWS starts charging you.

The final step for setting up the environment is actually uploading your whole genome to the S3 bucket you created. You can do this by running the scripts provided in the upload_template.sh file:

#!/bin/bash

# your S3 bucket name
S3_BUCKET="s3://your-bucket-name-here/"

# your source directory (where your genetic files are located, usually Downloads)
SOURCE_DIR="${HOME}/Downloads"

# file patterns (customize based on your data, but should be the same)
FASTQ_PATTERN="*.fastq.gz"      # FASTQ files
VCF_PATTERN="*.vcf.gz"          # VCF files

# directory structure in S3
FASTQ_DIR="raw-data/"           # where FASTQ files go
VCF_DIR="results/"              # where VCF files go

echo "uploading genomic data to S3"
echo "bucket: ${S3_BUCKET}"
echo "source: ${SOURCE_DIR}"

# upload FASTQ files
echo "uploading FASTQ files"
aws s3 cp ${SOURCE_DIR}/ ${S3_BUCKET}${FASTQ_DIR} --recursive --include "${FASTQ_PATTERN}" --exclude "*" --cli-read-timeout 0 --cli-write-timeout 0

# upload VCF files
echo "uploading VCF files"
aws s3 cp ${SOURCE_DIR}/ ${S3_BUCKET}${VCF_DIR} --recursive --include "${VCF_PATTERN}" --exclude "*" --cli-read-timeout 0 --cli-write-timeout 0

echo "upload complete!"
echo "Check your S3 bucket: $(echo ${S3_BUCKET} | sed 's|s3://||' | sed 's|/||')"

Analysis

Now that you have everything on AWS, what do you do? The outline of our analysis will follow what is below:

The pipeline consists of creating a VCF file containing the scored and annotated variants in your genome against a reference genome, doing some computations, then returning some scores. If you want to skip the steps below and just read or copy/paste the code, you can just do that.

The analysis follows 5 simple steps:

install tools: htslib, BWA, samtools, bedtools, GATK
align reads: FASTQ to BAM with BWA
improve quality: mark duplicates, BQSR
call variants: SNVs/indels with GATK
annotate results: VEP, VPOT, SV/CNV calling

I. Call variants (GATK → VCF file)

install htslib for compressing/decompressing the FASTQ files
install BWA to convert the FASTQ files to BAM, which will then be run through samtools (since it is recommended to do annotation on a BAM file, not FASTQ)
download the GRCh38 no-alt-configs reference genome
install samtools for compression/decompression of those BAM files

in our code, we align the FASTQ reads to the reference genome, then immediately pipe output BAM file to samtools for compression, saving RAM space
install bedtools if your genome comes as a BAM file from the sequencing company, you will need to extract reads from it as FASTQ; I don’t need this since I have FASTQ from Nucleus

use Java v8 to run GATK ~4.3.0.0, or Java 17 for GATK 4.4.0.0
install GATK to compute a whole bunch of stuff

BQSR (Base Quality Score Recalibration): applies ML to improve the nucleotide-specific quality scores
VQSR (Variant Quality Score Recalibration): help determine if variants are false-positive sequencing errors or true-positive variants

II. Annotate with VEP (adds scores)

VEP: Variant Effect Predictor, a tool developed by Ensembl that takes a VCF file (with indels, SNVs) and annotates each variant with biological information. VEP gives you a giant table of annotations (where the variant is in the genome, what effect it might have, predicted consequences, population frequency, CADD/REVEL scores, ClinVar status, etc), but it doesn’t tell you which variants matter most

III. Rank variants with VPOT

VPOT: Variational Prioritization Tool, a tool developed by the Victor Chang Cardiac Research Institute, is a post-processing tool used after VEP annotation. VEP provides raw annotations and pathogenicity scores (CADD, REVEL, etc from above), then VPOT applies a user-defined set of prioritization rules (via a manually created PPF file) to score and rank variants, returning a ranked list of the most impactful ones.
You can also do something called Trio Analysis that involves comparing your genome to the genome of the mother and father, which I describe through the different inheritance models in the next section (What useful information can we get from our genome?).

IV. Annotate SVs and CNVs

Above, we looked for single-nucleotide variants (SNVs), but there are more mutations than just single letters. Structural variants (SVs) are larger variants than single-nucleotide variants (SNVs), and are defined as being >50 base pairs in length. You can identify them by breakends, where one DNA sequence is connected to another one in a nonadjacent region:

The software that identifies breakends is called Manta.

CNVs are a subtype of SVs (the long, >50 base pair mutations) that refer to duplications and deletions that change the actual number of copies of a DNA segment. You can spot variations in read depth to detect copy number variants (CNVs) like deletions or gains. A software to do this is called CNVnator. The Python version of it is CNVpytor. To spot CNVs >1M base pairs in length, you can use Conanvarvar.

What useful information can we get from our genome?

Trio analysis involves looking at the parents and child, and VPOT can tell us some information about the inheritance. This is helpful in contextualizing and understanding a disease.

Autosomal dominant means that the child and one of the parents are affected by the genetic disease, while the other parent is not affected. This occurs when the gene variant is able to cause the disease when it is present in only one of the chromosomes, and not on both (like an OR gate in boolean logic). Huntington’s disease (recently treatable with gene therapy!) is an example of an autosomal dominant disease.

Autosomal recessive means that the child is affected, but neither of the parents are affected. However, both parents are carriers. This occurs when the gene variant is able to cause the disease only when it is present on both chromosomes. Each parent has a copy of the variant on their own genome, and the child inherits that (like an AND gate in boolean logic). Cystic fibrosis is an example of this.

Compound heterozygous is similar to autosomal recessive, except it involved two different variants in the same gene. The parents each have one good copy of the gene, but each also has one bad copy. The child, unfortunately, inherits the combination of the two to get the bad gene (like an XOR gate but with AND- each has a complementary debilitating variant, when combined, cause the genetic disease in the child).

De-novo means that the child has a new variant that the parents don’t have. It has spontaneously risen in the embryo. There are 40-80 de-novo variants in each embryo, but usually they emerge in regions that don’t manifest the disease.

Case-control means creating a graph comparing all of the family members. It is essentially what Nucleus is doing with their family offering.