9/17/25

GitHub repo: Embed GitHub

I had my genome sequenced by Nucleus and received 54 compressed GB worth of files. Decompressed, this turn into 273GB. In order to run some analyses, I started reading this book, but went off-script once I got the gist of it. The authors are great- they really took the time to understand everything. I recommend buying the book.

This write up is intended for software engineers who have some experience with AWS/cloud, so I am going to skip over a lot of implementation details and expect that you can fill the gaps. If you need help, just reach out to me, but ChatGPT should be able to answer everything pretty easily.

Lastly, I’ve provided this whole project as a public Terraform repo, so you can follow the README.md if you want to spin this up on your own cloud environment. I used the c7i.4xlarge CPU (Intel Xeon Platinum 8488C, 8 physical cores (16 vCPUs), 32 GB RAM), so you may want to change the size of it in the terraform.tfvars. However, this whole project should fit under the $100 in free credits they give you, especially if you do it in one sitting. I budgeted for 2 weeks of full compute, coming to a little over $100.

Project

First, you will need to get your genome sequenced. I recommend Nucleus or Nebula Genomics. They allow you to download your genome, which they send in an email with one VCF file declaring your variants, and 16 FASTQ files, with both sides of your full genome (since it’s a double helix).

Then, you need an AWS account. In order to qualify for the free $100 credits, you’ll need to sign up with a new email + credit card, ones that you have never used before with AWS. You can get an additional $100 by doing 5 of their really short tutorials, which is 100% worth it (it shows up on the homepage after you have created an account).

Once your genome is sequenced, you can request for them to send you the compressed version of your genome. Depending on the company, they will send you the files either in FASTQ or BAM. There are other file types, but they are not popular. After downloading your whole genome, you will need to configure your AWS account locally with aws configure. You can install this CLI with homebrew, curl, etc.

You will then need to follow the instructions in the README.md to set up the environment + terraform.tfvars:

region              = "us-east-1"                # use whatever is closest to you
project_name        = "your-project-name"        # like "wgs-analysis-your-name"
home_ip_cidr        = "YOUR.IP.ADDRESS/32"       # your public IP with /32
ssh_key_name        = "your-ec2-keypair-name"
s3_bucket_name      = "your-unique-bucket-name"  # must be globally unique
instance_type       = "c7i.4xlarge"              # or "g5.2xlarge" for GPU
disk_gb             = 500
use_spot            = true
alert_email         = "your-email@example.com"
database_password   = "your-secure-database-password-here"
application_secret  = "your-application-secret-key-for-auth"

Once you have completed all of these steps, you then run terraform init, terraform plan, and terraform apply -auto-approve, and this will deploy the entire repo to your AWS account. Make sure that you have the free $100 credits in your account, because the second this is deployed, AWS starts charging you.

The final step for setting up the environment is actually uploading your whole genome to the S3 bucket you created. You can do this by running the scripts provided in the upload_template.sh file. The script is templatized for you to just copy and paste your values.

Analysis

Now that you have everything on AWS, what do you do? The general outline of our analysis will follow what is below:

The pipeline consists of creating a VCF file containing the scored and annotated variants in your genome against a reference genome, doing some computations, then returning some scores. If you want to skip the steps below and just read or copy/paste the code, you can just do that.

Complete Whole-Genome, CPU-based Analysis Pipeline Summary

Pipeline Overview

Total runtime: ~25 hours

Sample name: Bradley_Woolf (for file reference below, yours will be your name)

Reference: GRCh38

Hardware: 24-core CPU, 32GB RAM, 1.6TB of memory

input: decompressed FASTQ files (273GB)
    ↓
1. quality control (MultiQC, 30 min)
    ↓
    quality metrics, coverage statistics
    ↓
2. alignment (GATK BWA-MEM. 12 hours) → BAM
    ↓
    Bradley_Woolf_merged.bam (93GB)
    ↓
3. mark duplicates (GATK Picard, 1 hour)
    ↓
    Bradley_Woolf_dedup_rg.bam
    ↓
4. base quality score recalibration (GATK BQSR, 2-3 hours)
    ↓
    Bradley_Woolf_recal.bam (93GB, final BAM)
    ↓
5. variant calling (GATK HaplotypeCaller, 3-4 hours)
    ↓
    Bradley_Woolf_raw.vcf.gz (4.97M variants)
    ↓
6. output VCF!
    ↓
7. variant filtering (GATK VariantFiltration, 30 min)
    ↓
    Bradley_Woolf_filtered.vcf.gz (4.88M variants)
    ↓
8. variant annotation (SnpEff v5.3, 30-45 min)
    ↓
    Bradley_Woolf_annotated.vcf.gz (214MB)
		Bradley_Woolf_annotated.tsv (871MB)
		Bradley_Woolf_snpeff_summary.html (622KB)
		Success: 98% (99,673 annotation errors out of 4.95M variants)
    ↓
9. variant prioritization (bcftools + awk, QUAL>=30, 1 minute)
    ↓
    Bradley_Woolf_prioritized_variants.tsv (656MB)
    4,581,727 high-quality variants (94% pass rate)
    ↓
10. structural variant detection (Delly, 1-2 hours)
    ↓
    Bradley_Woolf.sv.vcf.gz (4.5MB)
    ↓
11. copy number variant detection (CNVkit, 30 min)
    ↓
    Bradley_Woolf_recal.cns (81KB)
		Bradley_Woolf_recal.call.cns (40KB)
		Bradley_Woolf_recal-diagram.pdf (11MB)
		Bradley_Woolf_recal-scatter.png (79KB)
		Results: 593 significant CNV regions detected
		Coverage: 24,674 significant bins (4.21% of genome)
		Sex: Male (based on chrX/Y ratios)
    ↓
output: lots of information about your genome!

Final results

These are the outputs from above but as a summary. You can use my outputs as comparison for your own!

Metrics

Total variants processed: 4,879,865
High-quality variants: 4,581,727 (94% pass rate)
Annotation success rate: 98%
Structural variants: multiple DEL/DUP/INV/BND calls
Copy number variants: 593 significant regions
Processing speed: ~23,000 variants/sec (SnpEff)

Output Files

Bradley_Woolf_snpeff_summary.html: interactive annotation report
Bradley_Woolf_prioritized_variants.tsv: 4.58M high-quality variants
Bradley_Woolf_annotated.tsv: gene annotations with impact levels
Bradley_Woolf.sv.vcf.gz: structural variants
Bradley_Woolf_recal.call.cns: copy number variants
Bradley_Woolf_recal-diagram.pdf: CNV visualization

Analysis Quality

Coverage: high-quality alignment across genome
Variant Quality: 94% pass rate (QUAL≥30, FILTER=PASS)
Annotation: comprehensive gene-level annotations
Structural variants: large variants >50bp detected
Copy number: significant CNV regions identified

That being said, the output stats are whatever- I have no idea if they are high quality or not. This was mainly an exercise in programming around DNA. The biggest bottleneck, by far, was the GATK BWA algorithm that took 12 hours. I had to make performance improvements in the analysis part, mainly parallelizing + updating old packages. I had to parallelize Delly (4x speedup with 4 SV types simultaneously), parallelize SnpEff (4.4x speedup), increase the amount of memory for SnpEff to 16GB, and use all 24 cores.

The remaining time was spent debugging and adding additional analysis.

What useful information can we get from our genome?

This is more of an add-on section, but it seems like the most genomic information we can derive is more Mendelian than molecular. This means that we can get inheritance information and compare based on that, but not get the full gene → phenotype which would be personally valuable. For example, Trio analysis involves looking at the parents and child, and VPOT can tell us some information about the inheritance. This is helpful in contextualizing and understanding a disease.

Autosomal dominant means that the child and one of the parents are affected by the genetic disease, while the other parent is not affected. This occurs when the gene variant is able to cause the disease when it is present in only one of the chromosomes, and not on both (like an OR gate in boolean logic). Huntington’s disease (recently treatable with gene therapy!) is an example of an autosomal dominant disease.

Autosomal recessive means that the child is affected, but neither of the parents are affected. However, both parents are carriers. This occurs when the gene variant is able to cause the disease only when it is present on both chromosomes. Each parent has a copy of the variant on their own genome, and the child inherits that (like an AND gate in boolean logic). Cystic fibrosis is an example of this.

Compound heterozygous is similar to autosomal recessive, except it involved two different variants in the same gene. The parents each have one good copy of the gene, but each also has one bad copy. The child, unfortunately, inherits the combination of the two to get the bad gene (like an XOR gate but with AND- each has a complementary debilitating variant, when combined, cause the genetic disease in the child).

De-novo means that the child has a new variant that the parents don’t have. It has spontaneously risen in the embryo. There are 40-80 de-novo variants in each embryo, but usually they emerge in regions that don’t manifest the disease.

Case-control means creating a graph comparing all of the family members. It is essentially what Nucleus is doing with their family offering.

CPU-based whole-genome analysis