I’ve downloaded my personal whole genome sequence from Nucleus and the Nucleus Origin dataset. In this project, I’ll calculate my own polygenic risk scores (PRS) based on my sequenced genome and the Nucleus Origin model weights to find out what diseases I am at risk for.
I’ll also break down some genetic concepts using the Nucleus Origin model and my own genome.
The GitHub repo is here. It has an AGENTS.md file so your local AI code editor can execute everything without you needing to understand the code.
Nucleus Origin
From an initial review, Nucleus Origin isn’t a model with newly discovered links between base pair mutations and diseases, but rather, it’s an “average of averages.” Nucleus most likely took multiple GWAS studies with millions of single-nucleotide polymorphisms, harmonized those with their own data, and put it into one table.
I assume the goal of this is to create a predictive model that consumes genetic variant data, forecasts disease effect sizes, and outputs a score of someone’s genetic predisposition to disease.
The table itself is a 1.4GB .tsv reference matrix for polygenic risk scores (PRS) of common variants. The overall table has 2+ million SNP effect sizes across nine diseases and traits.
On the far left of the table you have something called a single nucleotide polymorphism (SNP), such as rs12132974. SNP is a fancy term for “this base letter is different from the human average that we have collected.” The table’s X axis has the common diseases, and Y has the single nucleotide contributing to that disease. The +/- indicates on the value itself indicates if the SNP specified on the Y axis increases the X axis phenotype or decreases it.
For example, using the -8.25609722264165e-06 value in the Alzheimer’s column for the SNP rs1213974 (first row), we can tell that the copy of the the T (the effect allele) reduces Alzheimer’s risk by about 0.00000826 units on the polygenic risk score scale.
So is this good? Yes, since a negative value indicates a decrease in the expression of the specific disease phenotype described on the X axis (negative = decrease, positive = increase).
What are Polygenic Risk Scores?
A genome is made up of A’s, T’s, C’s, and G’s. When you are conceived, the ATCG’s from your mom and the ATCG’s from your dad come together to form your unique ATCG’s. As you grow into an adult, these letters guide all the cells in your body on what to become and when. The genome, in this context, is a literal program for replication. The resulting “growth” is what makes you, you. Eye color, hair shape, height, metabolism, electrical sensitivity, etc are all influenced by the genome.
Just as the attributes that make you “you” are emergent, so are the “diseases” that manifest These byproducts of malformed base pairs can have long-term consequences that we consider harmful, as compared to a healthy, high energy, fast thinking human. On the other hand, they can also be beneficial for preventing harmful outcomes (a double negative makes a positive).
Therefore, we want to measure and calculate those malformed base pairs. The earlier we catch them, the more time we have to prevent the long-term compounding of those diseases. If we do not, the disease reaches a state where it is no longer treatable, so we must resort to extreme measures (such as consuming lots of pills, heavy chemical flushing like chemo, etc).
In order to know what is a disease and what is not a disease, we need to compare the genes in question with a resulting phenotype (the thing that makes you, a disease, X axis, etc).
To do this, we have a function called a polygenic score. A polygenic score is essentially a weighted sum that estimates how thousands or millions of small genetic differences (those SNPs) add up to influence a particular trait or disease risk. Each SNP in the Nucleus Origin table comes with a tiny “effect size” (the +/- number you see), which tells us whether having a particular letter (the effect allele) nudges the risk up or down (the +/-), and by how much (the number).
The score is just an approximation, but when you add up all of these tiny contributions across your genome, you get a personalized estimate of “genetic liability” for that condition.
How to Calculate Your Polygenic Risk Score
Remember, the goal of the Origin model is to identify associations, not mechanisms. It does not encode biological laws, but rather makes observations on probabilities. It predicts disease risk from genetic variants and forecasts outcomes from input data without explaining underlying mechanisms. Because we aren’t modeling the in-between molecular biology, we just use basic statistics and compare against everyone else’s diseases.
For example, if we wanted to compare our scores against a population dataset, we would follow the equation below:
Where:
- = my personal genetic risk score for disease (z-score)
- = sum of every every SNP flowing through the following function, from the 1st to the Nth
- = how much SNP number increases/decreases risk for disease (GWAS effect size/weight)
- = how many risk-increasing copies you personally have of SNP
- = total number of genetic variants (SNPs) included in the score
- = the average genetic risk score for disease across a large reference population
- = how much the genetic risk scores for disease typically vary (spread) in that reference population
- = alleles at each SNP
- 0 = homozygous reference (AA)
- 1 = heterozygous (AT)
- 2 = homozygous alternate (TT)
The risk score for a specific disease is calculated by taking every genetic variant (SNP) that I share with the Nucleus Origin model, multiplying the known effect size of that variant on disease by the number of risk alleles I carry at that variant and then adding up all of those individual contributions across all variants. If we wanted to see how far away we are from a normal population using and .
In even simpler terms, my genetic risk score for a specific disease is the sum of (how strongly that location affects the disease) (how many risk versions of that location I personally have) across all DNA locations. So there is 1 calculated per disease.
Thus, the output of this experiment will have 9 scores for my genome, where this formula is applied to each disease produces its own score.
Computing the PRS’s
Before we start, there is a problem- the VCF that comes from Nucleus doesn’t have rsIDs, which are the SNP identifiers that the Nucleus Origin dataset uses (the first column on the left). So, we need to map our genome to these rsID’s. The full process is below:
- Obtain your genome’s VCF:
- Download the Nucleus Origin weights
- Download the dbSNP reference (GRCh38 VCF)
- Annotate our genome with rsIDs
- Validate mapping accuracy
- Compute PRS for each disease
Download your personal whole-genome variant file (.vcf or .vcf.gz) from Nucleus. You can request this through your personal account. The VCF contains your variants in coordinate form (chromosome, position, REF, ALT), which we then map to rsID form.
This is also requested through the Nucleus site. The NUCLEUS_ORIGIN_V1.tsv file provides per-SNP effect sizes (β coefficients) for nine diseases, indexed by rsID and effect allele (A1).
Get the official dbSNP VCF from NCBI matching your genome build. You most likely also have GRCh38/hg38. This is the reference file that maps rsIDs ↔ genomic coordinates (chr, pos, ref, alt).
After EXTENSIVE searching, I found this index that has the full 15GB VCF of the full human genome. It should have common and rare variants we can use as reference.
Use bcftools annotate to map your variants by coordinate to dbSNP and insert the rsIDs into your VCF (or just execute the script in the repo). You should have a VCF with rsIDs aligned to Nucleus’s indexing scheme after running.
In order to do this efficiently, we filter the dbSNP data with the rsIDs from the Nucleus Origin dataset to get positions, then match our VCF to Nucleus Origin by position (chr:pos). We need to match these three sets look up the effect sizes. It’s annoying, but necessary. We can represent this with set notation.
It should be fine, but spot-check a few variants to confirm that rsIDs, positions, and alleles match (no strand flips or coordinate mismatches).
Join your annotated VCF (rsIDs + genotypes) with Nucleus’s effect-size table on rsID.
For each disease :
Note: we don’t compare my personal genome to a population one, so we don’t use and .
To implement this whole process yourself, use the code on my GitHub account. It’s easier to do it with Cursor/Claude than for me to copy and paste code snippets in here.
Results
Like I mentioned earlier, we essentially need to do a join operation between the 3 sources- the 15GB dbSNP we downloaded, the Nucleus Origin dataset, and our personal genome VCF. So, we filter the dbSNP’s via the Nucleus Origin SNP rsID’s (far left column on the table) to get positions from the dbSNP (rsID → chr:pos), then match our VCF to those positions (since our VCF only has position data, not rsID’s), then use the rsID’s (from the mapping) to look up effect sizes in the Origin table for our personal genome.
Again, have Cursor/Claude execute it. It’s easier.
Of the 7,356,518 rsIDs from Nucleus Origin, 7,355,667 were found in the dbSNP, meaning their genomic positions are known. Only 851 were missing (0.01%), likely because they are newly discovered, merged, or unvalidated variants. This 99.99% match rate is excellent- our PRS will be computed from 7.35 million SNPs instead of 7.36 million, which is an insignificant difference at this level of granularity.
Now we can calculate the PRS’s for the 9 diseases. The process is simple and only takes a few minutes (on my 24 core CPU):
- load the Nucleus Origin reference (~30 seconds)
- load mapping file (~30 seconds)
- index our VCF (~1-2 minutes)
- match SNPs by position (~5-10 minutes)
- calculate the PRS scores for all 9 diseases (~1 minute)
- save results
After running scripts/calculate_prs.py, we get the following results:
- Matched: 2,778,521 SNPs (37.77% match rate)
- Dropped: 296 SNPs (allele mismatches)
- PRS scores calculated for all 9 diseases
Disease | PRS |
ALZHEIMERS | -0.942 |
BREASTCANCER | -0.102 |
CORONARYARTERYDISEASE | -1.033 |
ENDOMETRIOSIS | 1.891 |
HYPERTENSION | 0.157 |
PROSTATECANCER | -2.967 |
RHEUMATOIDARTHRITIS | 0.162 |
TYPE1DIABETES | 1.321 |
TYPE2DIABETES | -3.722 |
These are raw PRS scores (sum of β × genotype). For interpretation, we would normalize using population statistics (mean/standard deviation), like in the extended formula from the “How to Calculate Your Polygenic Risk Score” section.
Conclusion
I believe that Nucleus Origin attempts to start the flywheel for mapping the in between states from DNA to expressed phenotype. If we wish to make genetic engineering a commodity, this is the logical next step we need to take as a species- quantifying the relationships between individual base pairs and their phenotypes. There is a ton of biology in between the two, but I see this table as an initial stepping stone, which of course, builds off of the GWAS stepping stones and all of the genetic testing technology that came before them.
If we truly want to put human logic in charge of our evolution, and preferably not just for new-borns, we need vastly more data. We need every human to contribute every byte, not for the greater good, but for their own individual self-benefit. With models, we have a way of collecting global information, bounded, of course, by what the sensor technologies can collect, our programs can process, and our silicon can store.
I imagine there will be countless datasets like this. Origin is extremely primitive- even the diseases are defined in English rather than expressed as biochemical classification graphs, and there are almost infinitely more dimensions we can add onto this.
But it does show a future that is already here!
- It really depends on the environment, which is why evolution in the context of environmental change is so important to understand
- This is essentially the root cause of our bloated, inefficient, and expensive healthcare system. The current incentive structure is structured to fix the effect instead of eliminating the root cause (which, I would argue, is a technology + engineering problem).
- We can express the full process of connecting the Nucleus Origin dataset, dbSNP reference, and our personal genome using set notation:
Let:
- = set of all rsIDs in the Nucleus Origin dataset
- = set of all variants in the dbSNP reference
- = set of variants in our personal genome VCF
where represents the number of alternate alleles we carry at that locus.
1. Filter dbSNP by Nucleus rsIDs
This isolates only the variants in dbSNP that are relevant to the Nucleus model.
2. Match our genome to these variants
This creates the intersection of all variants shared between my genome and the Nucleus model.
3. Compute Polygenic Risk Scores (core formula from above)
For each disease in Origin
where is the effect size for SNP and disease
Interpretation
- : defines which variants matter (Nucleus Origin model)
- : translates rsIDs into coordinates
- : provides our genotype data
- : intersection of all three: the set of variants used to compute our personal polygenic risk scores