What information can we actually get from a whole genome once it has been sequenced? There are a lot of categories, but I would personally categorize all available information into 3 buckets:
- Isolation- What can we get from this specific genome in isolation? What structural observations, patterns, functions, phenotypes, and integrity can we observe on each genome in isolation? To do a lot of this, we compare to the scientific “average” genome, GRCh38.
- Comparative- What information can we get when we compare our genome to a graph of other humans and other species? What is the history of our genome? Where did we diverge relative to, say, the monarch butterfly? Are there biochemical compounds that overlap between our species and others that we have missed that could be drugs?
- Engineering- What phenotypical changes can we predict when we take out specific observed genes? What about when we insert some base pairs in a specific location? What about using some specific CRISPR version with this specific attached compound for our genome? How will this genome respond to some drug, targeting some cause, nullifying some effect, etc?
Sidenote: One glaring hole that I have observed involves the metadata around systems beyond the genome- meaning we don’t yet have the data and tools to predict changes at a systemic level, far outside of the genome. This includes multicellular, multilevel biological changes, which is actually how you solve disease completely throughout an individual. Rather than targeting the effect, you target the underlying cause, which is currently beyond the state of our technology.
General biopharma is incentivized to minimize the feeling of pain via small molecules, rather than treating the underlying systems-level problem. (IE block the signal that is causing us to have a headache, vs stretch out the part of the neck that is knotted and causing the headache). Antibiotics are different- they actually kill whatever is trying to kill us. I’m not yet experienced enough to comment on the rest of biochemistry.
Isolation
Gathering information about the individual genome is primarily about gathering observations about its literal structure and predicting effects from that structure, then comparing that to the average genome, GRCh38. There is also a new one called T2T-CHM13 that is a complete telomere-telomere build.
This section primarily involves:
- Variants: like single-nucleotide polymorphisms (single-base changes), indels (small gaps or insertions), structural variants (changes in structure beyond the single-nucleotide level, usually entire regions), copy number variations (sections of DNA ≥1 kb that are duplicated or deleted), and mobile element insertions (a sequence of DNA that can copy or move itself, where it gets inserted into a new location in the genome).
- software tools:
BWA,GATK,DeepVariant,DRAGEN,Sentieon,Manta,Lumpy,CNVnator,NVIDIA Parabricks - Functions: once you have the differences annotated between your genome and a database, you want to annotate them. These annotations include the gene context (coding, intronic, intergenic, UTR regions), predicted effects (missense, nonsense, synonymous, frameshift, splicing), pathogenicity scores (
SIFT,PolyPhen-2,CADD,REVEL), regulatory regions (enhancers, promoters, TFBSs, eQTLs), and finally pathway mapping (viaReactome,KEGG,GO,BioCyc). - software tools:
ENCODE,GTEx,VEP,SnpEff,ANNOVAR,Funcotator - Clinical: now that you have the functions, you need to see how the effects of those functions manifest in a clinical setting. How will you react to specific treatments? To answer these questions, we look at Mendelian disease analysis (identify known pathogenic variants via ClinVar, OMIM), polygenic risk scores (genome-wide additive risk estimation), pharmacogenomics (drug metabolism predictions, like CYP450 genotypes), and carrier screening (recessive allele identification). This oftentimes overlaps with the pathway databases mentioned in the Functions section.
- databases:
ClinVar,gnomAD,HGMD,PharmGKB - Quality control: quality control of the genome includes a coverage analysis (if the genome is biased, causing biased variant calls), contamination detection (VerifyBamID), and sex inference (based on X/Y coverage). This is usually done after the sequencing machine.
- software tools:
Picard,FastQC,QualiMap,VerifyBamID - Epigenetic context prediction: the epigenome is the collection of chemical modifications that determine which genes are turned on and off, without changing the underlying genome. So we are inferring gene regulation and cell activity patterns from just the genome sequence using ML/DL. This includes DNA methylation propensity (CpG islands, sequence motifs), histone modification likelihood, and open chromatin regions (promoter/enhancer sequence features).
- software models:
Enformer,Basenji2,EpiGePT - Non-coding and regulatory logic: deep sequence embeddings capture enhancer-promoter syntax and regulatory grammar from DNA, and repeatome and structural bias analysis detect genome-architecture constraints and chromatin patterns (
RepeatMasker,TRF,DeepRepeat) - software tools:
Evo2,HyenaDNA,Enformer v2,Basenji2,EpiGePT
Comparative
In this section, we look for additional information by comparing to other humans and species. Whereas the previous section was a combination of looking at the physical structure, surrounding chemicals, and variants when compared to the scientific “average” genome, GRCh38/T2T-CHM13, this one deals mainly with graphs of additional metadata.
There is a graph of human genomes called HPRC Pangenome, which is a graph of human haplotypes capturing population diversity. This aligns to a population graph to reduce reference bias and expose variants missing from linear builds.
Anyways, this section includes:
- Population-level analysis: information like ancestry composition (population admixture, PCA projection vs. 1000 Genomes, HGDP) tells you how your genome moved and evolved across time, haplotype phasing (resolve which alleles co-occur on the same chromosome), Runs of Homozygosity (ROH) reveal inbreeding or selection, and population structure (founders of a new population, bottlenecks, migrations).
- software tools:
PLINK,ADMIXTURE,Beagle,SHAPEIT,King,EIGENSOFT,vg,minigraph,HISAT2-graph(HPRC),Giraffe - Evolutionary genomics: conservation scores for inter-functional constraints (PhyloP, PhastCons), positive selection (via dN/dS ratios, selective sweeps), and ortholog/homolog detection (identify gene conservation across species).
- software tools:
PAML,OrthoFinder,Phylip,BEDToolswith conservation tracks,Foldseek,GNPS,BiG-SCAPE 2.0,antiSMASH 7,DeepCons,Zoonomia 2024,PhyloP,PhastCons - Structural genomics: this part includes predicting 3D chromatin structure (via Hi-C data integration), inferring enhancer-promoter loops from sequence + Hi-C, and predicting protein function/structure for novel variants (found from comparing many, many genomes).
- software tools:
AlphaFold,Evo2,Foldseek,DeepSEA,Basenji - Multi-omics integration: lastly, we can check for multi-level information via expression Quantitative Trait Loci (eQTLs, which link variants to gene expression), methylation Quantitative Trait Loci (meQTLs, which link variants to methylation states), proteome/metabolome QTLs (variant → protein abundance), and causal inference (Mendelian Randomization, linking variant → biomarker → phenotype).
- software tools:
FastQTL,TensorQTL,COLOC,SuSiE,MOFA+,mixOmics,TwoSampleMR,MR-Base
Engineering
Once we have the informational base, what can we do with it? What new technologies, techniques, industries, sci-fi fantasies can we make real? This is the most compelling part- what we can do with the genome once we understand it.
- Synthetic and predictions: predict CRISPR off-target sites, design gene edits with minimal off-targets, predict transcriptional outcomes of mutations, and model expression from sequence (sequence-to-expression transformers)
- software tools:
CRISPRitz,DeepCRISPR,PredictDB,Evo2,BPNet,BE-Deep,PrimeDesign,Enformer v2,xTrimoGene,EVE,AlphaMissense,PrimateAI-3D - Engineering-level prediction and modeling: predict protein folding/misfolding after mutation, simulate functional domain disruptions, estimate phenotypic consequences from genetic changes (knockout, insertion, promoter swaps), and drug-target interaction modeling (how this genome responds to this drug, targeting this cause, nullifying that effect, etc).
- software tools:
AlphaFold,Evo2,DeepDrug,MolFormer,AutoDock Vina - Design-test-learn loops: this section is more about workflow, but integrate genome → expression → phenotype modeling pipelines, optimize edits using CRISPR screens, directed evolution, and deep mutational scanning data, feeding results back into predictive ML models for improved next-round design, integrating structure, regulation, and testing experimental constraints in silico before wet lab validation.
- software tools: Airflow/Nextflow/Snakemake with all the above, like
ColabFold,Foldseek,Rosetta Scripts,LabMate,Synthace, andBenchling API
As you can see, the second and third sections are smaller than the first. This is because this area is the least developed. The most opportune.
Implementation
Now it is time to implement our pipeline. We’ll decide what to execute, when, and how. The plan is below:
[WIP for implementation- waiting for the book Statistical Genomics to come in]