Sources:
UConn class: Embed GitHub
Paper: NCBI - WWW Error Blocked Diagnostic
In this project, we use scVI
to reduce dimensions and find latent patterns in our Leukemia datasets. This strategy finds hidden factors that lie within the data itself, as opposed to manually searching through data. This is the most important part of deep learning, in my opinion.
Intro
I attended SynBioBeta in May where I briefly spoke with Daniel Ives, the founder of Shift Biosciences, who was an extreme AI skeptic. He mentioned that despite his initial skepticism, his team kept trying and found new patterns, previously unknown, with deep learning.
This project showed me that these new DL software + parallelized hardware architectures really do change biology. In a more economic context, they dramatically change the unit economics of biology. But that will be for later.
Manual vs Deep Learning Analysis
So how does one actually use deep learning to detect hidden patterns, as opposed to making linear transformations on the datasets until new information is found?
This breaks down into 4 steps:
Manual Analysis
1. Data Processing (HPC Pipeline)
- Download FASTQ files from SRA
- Quality control (FastQC, MultiQC)
- STAR alignment (CPU, 4-8 hours per sample)
- Generate count matrices
2. Basic Preprocessing (Jupyter)
- Load count matrices into AnnData
- Normalize and log-transform
- Filter cells/genes
- Select highly variable genes
3. Manual Analysis (Jupyter)
- Manual cell type annotation (CellTypist)
- Manual differential expression (edgeR)
- Manual visualization (Seurat plots)
- Manual pattern discovery (pick genes by hand)
4. Results & Reporting
- Generate statistical tables
- Create publication figures
- Write analysis reports
- Document findings
Deep Learning Analysis
1. Data Loading & Preprocessing
- Load pre-processed 10x count matrices
- Quality control + filtering
- Prepare both datasets (scVI-ready + PCA-ready)
- Validate data quality
2. Deep Learning Analysis (scVI)
- Train NN on GPU (minutes vs hours)
- Automatic pattern discovery (10 latent dimensions)
- Automatic batch effect removal
- Generate embeddings
3. Automated Analysis Pipeline
- Auto cell type annotation (with fallbacks)
- Automated differential expression
- Pseudotime analysis
- Drug target identification
4. Output Generation
- Generate all visualizations automatically
- Create analysis reports
- Export results in multiple formats
- Save intermediate data objects
UConn Pipeline Results
- 5,584 DE genes (healthy vs cancer)
- 3,283 DE genes (cancer subtypes)
- Top genes: MYO7B (-6.5 logFC), CASC15 (-5.2 logFC), TPO (+10.3 logFC)
- Cell types: B cells, T cells, Monocytes, etc.
scVI Pipeline Results
- ~15,000+ DE genes (healthy vs cancer) - Total significant genes across all clusters
- ~2,000-4,400 DE genes (cancer subtypes) - Per cluster analysis
- Top genes: TCL1B (+5.86 logFC) [ETV6.RUNX1.4], HLA-DQA2 (+5.08 logFC) [ETV6.RUNX1.1], HBB (+8.20 logFC) [Erythrocytes], CHI3L2 (+7.23 logFC) [PRE-T.1], EPHB6 (+7.03 logFC) [PRE-T.2]
- Cell types:
- Neutrophil: 6,633 cells
- NK_cell: 6,583 cells
- Monocyte: 6,565 cells
- T_cell: 6,477 cells
- B_cell: 6,467 cells
- Erythroblast: 5,332 cells
- Platelet: 1,154 cells
Key Differences
- More DE genes in our pipeline (likely due to scVI's better batch effect removal)
- Different top genes (TCL1B vs MYO7B)- scVI found different patterns
- Similar cell type distributions- both found B cells, T cells, monocytes
- Additional cell types- we found neutrophils, NK cells, erythroblasts, platelets
Conclusion
The analysis wasnât perfect, at all- I was learning NVIDIA HPC concepts, deep learning math, various Python and R packages, full end-to-end scRNA-seq pipeline workflows, statistics for scRNA-seq, and molecular biology concepts to understand what we are looking for under the hood. The analysis is far from perfect, and probably has major flaws. But, I think the thing it does show is the power of the models, even in their small state. 15 minutes of training on a single GPU for those insights- awesome!
Imagine if scVI had access to all cell types, in every person, and across all species? That would be an interesting model, and I wonder what molecular biology we would uncover that could lead to new drugs, therapeutic delivery systems, and overall lifeforms.
Computer Stuff
â when we trigger analyze()
What calculations is the GPU doing?
- Matrix multiplication for neural network forward/backward passes
- Gradient computation via backpropagation
- Loss function evaluation (negative binomial likelihood)
- Optimization updates (Adam optimizer)
How many ALUs, cores, threads, warps, blocks, and grids are being used for this?
nvidia-smi --query-gpu=compute_cap,memory.total,memory.used,utilization.gpu,utilization.memory --format=csv
GPU utilization
- RTX 4070 Ti: 8.9 compute capability
- Memory: 790MiB/12,282MiB (6.4% utilization)
- GPU: 28% utilization
- Memory bandwidth: 4% utilization
Hardware usage
- CUDA cores: ~7,680 active (RTX 4070 Ti has 7,680 total)
- Tensor cores: ~304 active (for mixed precision)
- Memory bandwidth: ~22 GB/s (of 504 GB/s theoretical)
- Warps: ~1,200 active (of ~2,400 total)
- Blocks: ~150 active (varies by kernel)
- Grids: Multiple concurrent kernels
Lowest level math operations
The scVI model primarily uses matrix multiplications (GEMM kernels) which are optimized to use FMA operations on tensor cores, plus activation functions (exponential/log) and statistical operations (mean, variance) for batch normalization.
Low memory utilization (6.4%) suggests the model fits easily in VRAM, while 28% GPU utilization indicates moderate computational intensity typical of scVI training. This utilization fluctuated between 21-71% throughout multiple pipeline execution rounds.
UMAP (pretty)
Definitions
Latent representations: latent comes from latÄre, meaning âto lie hidden.â Thus, we are using the trained scVI model to find inferred âhidden factors,â meaning they are not directly observed. They are âlatentâ in the data itself. It doesnât tell us how the cell was built, or why (what caused what), but it helps us navigate and understand the main structure of this particular system level
UMAP representation: a 2D or 3D embedding of high dimensional data that preserves both local and global structure- it places the cells in multidimensional space where nearby points have similar expression profiles
HuggingFace: GitHub for model weights that are too heavy to be pushed to a GitHub repo
CellTypist annotations: use this to autofill the AnnData objects
Differential expression analysis: analysis process to figure out what genes are being over/under-expressed and in what environments, aka gene expression under different conditions/circumstances. mRNA is used as a proxy indicator for the genes that are being turned on
â this is going to be the end state of consumer health technology- not just DNA, obviously, all chemical expression will be modeled, and under what conditions;