Summary

Problem: Counting k‑mers over large genomic datasets (55 GB whole‑genome sequences) is computationally intensive and inefficient on standard CPU pipelines, and even API-heavy GPU pipelines (tons of memory overhang).

Approach: We implement a custom CUDA kernel to accelerate k‑mer counting directly on GPU. It’s a focused exercise converting a simple k‑mer counting algorithm into a GPU–native function for high-throughput whole-genome sequencing (WGS) processing.

Impact: This unlocks orders-of-magnitude speed-ups for k‑mer–based analyses on massive genomic datasets. Ideal for people building ultrafast bioinformatics tools or investigating GPU-accelerated sequence analysis, not necessarily hobbyists.

GitHub repo: Embed GitHub

Project

This is a brief write-up for a brief project. My goal with this was to learn what CUDA kernels are and when you would need to build one. The algorithm I chose to turn into a kernel is a simple k-mer counter over a 54GB WGS sequence from Nucleus.

That being said, CUDA kernels are literally just programs that manage the threads directly. A kernel manages one thread, which has private memory that you can manage directly. Writing kernels is just writing programs that interface directly with the hardware through a C++ interface.

This allows you to maximize your hardware. Kernels basically turn your general purpose GPU into a deterministic ASIC, tuned to the specifications of your hardware.

You typically write a kernel if you are going to use a specific type of hardware, like reserved cloud instances, and the current libraries that are available just aren’t cutting it. Kernels, in my opinion, are for managing extreme performance on offline devices or massive clusters, not for the single-GPU projects that I have built so far.

Thus, I probably won’t be building more of these, until whatever product I build requires that type of performance. If you would like to learn more about GPUs through a couple of stories, check out my piece on it.

The GPU Glossary blog is also an awesome resource and they go into extreme detail about how GPUs work.

This is also a great use of raw CUDA kernels in prod from the Cursor team: https://cursor.com/blog/kernels. We would do the same with our K-mer kernel, but obviously much better.

K-mer Counting CUDA Kernel

Summary

Project