Summary

Problem: scRNA‑seq pipelines are often slow, manual, and lack deep pattern discovery; conventional workflows are computation‑heavy and laborious.

Approach: We build a deep learning–accelerated pipeline using scVI (variational autoencoder) for dimension reduction, batch correction, automatic cell‑type annotation, differential expression, pseudotime, and target identification, all automated to run efficiently on a single 4070ti GPU (~15 minutes).

Impact: We find dramatically more differentially expressed genes (15,000+ vs. manual 5,000), including novel biomarkers and additional cell types like neutrophils and erythroblasts. Illustrates how affordable DL speeds can unlock new insights in single-cell biology.

Sources:

Paper: NCBI - WWW Error Blocked Diagnostic

UConn class: Embed GitHub

GitHub repo: Embed GitHub

Intro

I attended SynBioBeta in May where I briefly spoke with Daniel Ives, the founder of Shift Biosciences, who was an extreme AI skeptic. He mentioned that despite his initial skepticism, his team kept trying and found new patterns, previously unknown, with deep learning. He switched and I now a huge deep learning supporter.

In this project, we use scVI to reduce dimensions and find latent patterns in our Leukemia datasets. This strategy finds hidden factors that lie within the data itself, as opposed to manually searching through data

This project showed me that these new DL software + parallelized hardware architectures really do change biology. In a more economic context, they dramatically change the unit economics of biology. But that will be for later.

Manual vs Deep Learning Analysis

So how does one actually use deep learning to detect hidden patterns, as opposed to making linear transformations on the datasets until new information is found?

This breaks down into 4 steps:

Manual Analysis

1. Data Processing (HPC Pipeline)

Download FASTQ files from SRA
Quality control (FastQC, MultiQC)
STAR alignment (CPU, 4-8 hours per sample)
Generate count matrices

2. Basic Preprocessing (Jupyter)

Load count matrices into AnnData
Normalize and log-transform
Filter cells/genes
Select highly variable genes

3. Manual Analysis (Jupyter)

Manual cell type annotation (CellTypist)
Manual differential expression (edgeR)
Manual visualization (Seurat plots)
Manual pattern discovery (pick genes by hand)

4. Results & Reporting

Generate statistical tables
Create publication figures
Write analysis reports
Document findings

Deep Learning Analysis

1. Data Loading & Preprocessing

Load pre-processed 10x count matrices
Quality control + filtering
Prepare both datasets (scVI-ready + PCA-ready)
Validate data quality

2. Deep Learning Analysis (scVI)

Train NN on GPU (minutes vs hours)
Automatic pattern discovery (10 latent dimensions)
Automatic batch effect removal
Generate embeddings

3. Automated Analysis Pipeline

Auto cell type annotation (with fallbacks)
Automated differential expression
Pseudotime analysis
Drug target identification

4. Output Generation

Generate all visualizations automatically
Create analysis reports
Export results in multiple formats
Save intermediate data objects

UConn Pipeline Results

5,584 DE genes (healthy vs cancer)
3,283 DE genes (cancer subtypes)
Top genes: MYO7B (-6.5 logFC), CASC15 (-5.2 logFC), TPO (+10.3 logFC)
Cell types: B cells, T cells, Monocytes, etc.

scVI Pipeline Results

~15,000+ DE genes (healthy vs cancer) - Total significant genes across all clusters
~2,000-4,400 DE genes (cancer subtypes) - Per cluster analysis
Top genes: TCL1B (+5.86 logFC) [ETV6.RUNX1.4], HLA-DQA2 (+5.08 logFC) [ETV6.RUNX1.1], HBB (+8.20 logFC) [Erythrocytes], CHI3L2 (+7.23 logFC) [PRE-T.1], EPHB6 (+7.03 logFC) [PRE-T.2]
Cell types:

Neutrophil: 6,633 cells
NK_cell: 6,583 cells
Monocyte: 6,565 cells
T_cell: 6,477 cells
B_cell: 6,467 cells
Erythroblast: 5,332 cells
Platelet: 1,154 cells

Key Differences

More DE genes in our pipeline (likely due to scVI's better batch effect removal)
Different top genes (TCL1B vs MYO7B)- scVI found different patterns
Similar cell type distributions- both found B cells, T cells, monocytes
Additional cell types- we found neutrophils, NK cells, erythroblasts, platelets

Conclusion

The analysis wasn’t perfect, at all- I was learning NVIDIA HPC concepts, deep learning math, various Python and R packages, full end-to-end scRNA-seq pipeline workflows, statistics for scRNA-seq, and molecular biology concepts to understand what we are looking for under the hood. The analysis is far from perfect, and probably has major flaws. But, I think the thing it does show is the power of the models, even in their small state. 15 minutes of training on a single GPU for those insights- awesome!

Imagine if scVI had access to all cell types, in every person, and across all species? That would be an interesting model, and I wonder what molecular biology we would uncover that could lead to new drugs, therapeutic delivery systems, and overall lifeforms.

Computer Stuff

→ when we trigger analyze()

What calculations is the GPU doing?

Matrix multiplication for neural network forward/backward passes
Gradient computation via backpropagation
Loss function evaluation (negative binomial likelihood)
Optimization updates (Adam optimizer)

How many ALUs, cores, threads, warps, blocks, and grids are being used for this?

nvidia-smi --query-gpu=compute_cap,memory.total,memory.used,utilization.gpu,utilization.memory --format=csv

GPU utilization

RTX 4070 Ti: 8.9 compute capability
Memory: 790MiB/12,282MiB (6.4% utilization)
GPU: 28% utilization
Memory bandwidth: 4% utilization

Hardware usage

CUDA cores: ~7,680 active (RTX 4070 Ti has 7,680 total)
Tensor cores: ~304 active (for mixed precision)
Memory bandwidth: ~22 GB/s (of 504 GB/s theoretical)
Warps: ~1,200 active (of ~2,400 total)
Blocks: ~150 active (varies by kernel)
Grids: Multiple concurrent kernels

Lowest level math operations

Fused multiply-add (FMA): a * b + c (primary operation) (179)
Exponential: exp(x) for activation functions (233)
Logarithm: log(x) for loss computation (71)
Division: a / b for normalization (70)
Square root: sqrt(x) for RMS calculations (89)
Sum (60)

The scVI model primarily uses matrix multiplications (GEMM kernels) which are optimized to use FMA operations on tensor cores, plus activation functions (exponential/log) and statistical operations (mean, variance) for batch normalization.

Low memory utilization (6.4%) suggests the model fits easily in VRAM, while 28% GPU utilization indicates moderate computational intensity typical of scVI training. This utilization fluctuated between 21-71% throughout multiple pipeline execution rounds.

UMAP

Definitions

Latent representations: latent comes from latēre, meaning ‘to lie hidden.’ Thus, we are using the trained scVI model to find inferred “hidden factors,” meaning they are not directly observed. They are “latent” in the data itself. It doesn’t tell us how the cell was built, or why (what caused what), but it helps us navigate and understand the main structure of this particular system level

UMAP representation: a 2D or 3D embedding of high dimensional data that preserves both local and global structure- it places the cells in multidimensional space where nearby points have similar expression profiles

HuggingFace: GitHub for model weights that are too heavy to be pushed to a GitHub repo

CellTypist annotations: use this to autofill the AnnData objects

Differential expression analysis: analysis process to figure out what genes are being over/under-expressed and in what environments, aka gene expression under different conditions/circumstances. mRNA is used as a proxy indicator for the genes that are being turned on

→ this is going to be the end state of consumer health technology- not just DNA, obviously, all chemical expression will be modeled, and under what conditions;

scRNA-seq Deep Learning Pipeline for Leukemia Cells