A brief post on how GPUs work, examples of when to use them for the layperson, and CUDA kernels.

How a GPU works

GPUs started out as technology with a single, simple focus- improving the frame rate for computer animation. They solely focused on throwing pixels at a screen in an efficient way for programmers to build geometric shapes on top of. GPU design focuses on processing large blocks of data (triangles or polygons) in parallel, which is a requirement for any device showing ‘real world’ stuff on a screen.

They are simpler to manufacture than CPUs, so the design time is half the cycle of manufacturing CPUs. Since about 2012, they have been improving at twice the rate of CPUs.

A GPU accelerated system will have a main CPU, and can stitch together multiple GPUs to increase acceleration. High-level, it’s actually pretty simple:

(Parallel and High Performance Computing)

This includes the parts below, which is entirely familiar to someone who has built crypto mining rigs or a gaming PC:

CPU: the main processor that is installed in the socket of the motherboard

CPU RAM: the memory sticks containing DRAM that you insert into the memory slots in the motherboard (Dynamic Random Access Memory)

GPU: a large card installed in the PCIe slot on the motherboard (Peripheral Component Interconnect Express)

GPU RAM: memory modules on the GPU card for exclusive use by the GPU (comes from the factory in the large rectangle you get with fans)

PCI Bus: the wiring that connects the peripheral cards to the other components of the motherboard

Because maxing out computation is the equivalent of tuning a drug or a cars engine, computational scientists, engineers, and entrepreneurs started maxing out GPUs to solve problems and create new consumer experiences (as with any new general purpose technology). This led to CUDA in 2007 (Compute Unified Device Architecture) for NVIDIA chips, then OpenCL came along to be an abstraction layer across all GPUs. OpenACC and OpenMP were then built as abstraction layers on top of those two because scientists found them too difficult to work with.

Components of a GPU and How it Processes Data

A GPU is primarily composed of GPU RAM (global memory), a workload distributor, and compute units (called SMs in CUDA).

A compute unit (SM for NVIDIA) has its own architecture, composed of:

warp schedulers: dispatch instructions for groups of 32 threads
register file: per-thread private storage
shared memory/L1 cache: shared across threads in a block
tensor cores/FP32/64 cores: where the math happens
instruction dispatch logic: how to direct instructions to where
thread execution context slots: stores active thread states

NVIDIA SM (Streaming Multiprocessor)

The overall performance of a GPU is determined by its global memory bandwidth, compute units bandwidth, and the number of compute units. The compute unites are really just organized wrappers around threads.

To measure the raw arithmetic workload of a GPU, we use FLOPs (floating point operations). 1 FLOP = 1 add, multiply, or fused-multiply-add on floating-point numbers.

Peak theoretical FLOPs (GFLOPs/s) = Clock rate MHZ * Compute units * processing units * FLOPs/cycle

Within the compute units, the GPU has multiple graphics processors called processing elements (PE). NVIDIA calls these CUDA cores and the graphics community calls them shader processors.

Inside of the CUDA core lies a scalar arithmetic logic unit (ALU). This is a logic unit that executes one floating-point instruction per cycle for one thread. In addition to the ALU, there are registers for local storage for the active thread, and control logic that executes decoded instructions. It’s all resources for doing the underlying FLOP operation.

Inside of a CUDA core there is NOT a warp scheduler, shared memory, instruction decoder (receives decoded instructions), or parallelism- it is a scalar pipeline. Each core handle’s 1 thread’s math, driven by the warp scheduler above it.

warp scheduler ⇒ warp (32 threads) ⇒ 1 CUDA core (x32) ⇒ 1 thread (x32) ⇒ math op (SIMD)

There are a lot of memory layers to GPUs too.

private memory (register memory): immediately accessibly by a CUDA core (single thread) and only that core (thread)
local memory: accessible to a single SM (CU) and all of the CUDA cores (threads) on that SM (CU); about 64-96KB and can be used as a scratchpad for programming if need be
constant memory: read-only memory accessible from all SMs (CUs); written to by the CPU
global memory: memory that is located on the GPU and accessible by all of the SMs

Memory bandwidth is a big concern, especially because GPUs do so many calculations. We won’t dive into its calculation for the sake of this post, but check out CUDA By Example and Parallel and High Performance Computing.

Thread Programming Story

A scientist named Ana wants to speed up her experiment. She has to pour 1,000 test tubes of liquid. If she does it alone, it would take all day. So, she calls in some helpers. Each helper gets the same instruction set and works on their own set of tubes. They all share the same table, which limits the actions they can take, but each has their own tools (stack, counter, consumables). These individual helpers are threads.

Ana notices that if she gives very clear, step by step execution instructions to her helpers, and makes sure they don’t bump into each other or get in each other’s space, they finish the job 100x faster.

She is able to finish her experiment in an hour thanks to these helpers + instructions. Awesome! However, biology is hard and the data she collects isn’t statistically significant enough. In order to find significant data, she’ll need to pour 100,000,000 test tubes of liquid. Damn. She’ll need an insane number of workers to do this.

So she gets a robotic lab with this many tips, each doing their part of the work. To do this, she must program the robot to execute these correctly, like with the helpers. However, instead of English, she will be using CUDA or OpenCL. She now has to write clear, careful instructions for so each tip can execute in the same way. This plan is thread programming.

This system is used for massive jobs where she needs huge amounts of pouring. For small tasks, she will still work alone, pouring liquid sequentially, like a CPU.

Neural Network Thread Programming Story

(From our recent project, Frances.)

Ana, in addition to being a biologist, is also a synthetic biologist. She has learned so much about the general rules of biology that she now knows how to engineer it. She is designing a new strain of bacteria to produce custom medicine faster and cheaper.

She has a massive list of possible DNA edits she can make to the organism that will output the medicines. If she wants to make one edit, that’s easy. Even up to a dozen. But if she wants to really test the organism → DNA edit combinations thoroughly, she’ll need to test 100,000 edits and the combinations of them. In worst case, this is a combinatorial problem, where she has N genes that she tests independently and K-combinations of edits, meaning $O(N^K)$ edits. That’s exponential!

Good thing she has learned how to use parallelism. She writes a program that essentially tells each thread, “Try this set of edits, simulate the pathway, the resulting output molecule, and tell me the yield.”

But, she realized, if she tested ALL possible outcomes, it would still take too long! Even with 100 genes, there are trillions of possible outcomes! She needs some way of limiting the amount of computations that the threads needed to do. She needs to reduce the problem x computation space.

So Ana decides to train a neural network to guess which combinations are promising, then only tests those combinations. She relies on neural networks to find some generalizable pattern in the sequence of gene edits to desirable output so she doesn’t have to test everything, just the ones that matter.

This neural network is also built via programming these threads, so she has to program these threads to do something new. This ‘pattern matching algorithm’ requires lots of math: multiplying huge matrices, passing numbers back and forth through the underlying data structure (that tries to find structure in the gene edit x yield combination space), calculating errors, and updating the ‘pattern’ it finds.

Each thread handles a very small piece- one applies PEMDAS to numbers, one passes them through a rule, another updates a guess. Together, they work to form the process of training her neural network.

Thousands of these threads run at once, doing lots of computations through that reduced $O(N^K)$ edits space.

When the worker threads have stopped, she receives a statement from the GPU saying “we are done!” with the generalized pattern of gene edit → yield that she originally wanted to find!

Custom Kernels (bonus)

A kernel is just a function written in CUDA C++ that the GPU executes in parallel across many threads. This shows up as a .cu file in your text editor. It is a bit of a misnomer, since it is not a single unit of execution, nor is it tied to one SM, and it is not an OS kernel. It is a a parallel program template that’s cloned across tens of thousands of GPU threads, each running its own instance on different data.

Ok, so what benefits do I get instead of just using raw C++?

Really it’s about control- writing programs directly in CUDA gives you compilation optimized for the GPU. You can control what gets calculated where, when, and how. It really is just a more 1:1 programming model for the problem you are building a program to solve and the GPU hardware.

You build a kernel when you are doing something with large scale (production LLMs, HFT, etc), when existing libraries don’t fit your memory layout, indexing logic, branching pattern, etc. If you require a tighter library than the one that you are depending on for some intense operation.

In these cases, you would want to use CUDA to build your own kernel. Since most of the potential algorithms in bioinformatics have not been built out, there is a case to be made for making them in CUDA directly, deploying them as a kernel, vs building them in Rust or raw C++. But, it just depends on how much power you will get relative to time developing it.

TL;DR: it’s just a C++ function that runs on the GPU, not CPU; gets compiled into GPU assembly (PTX/SASS); and is launched across many threads in parallel. You build a kernel when an off-the-shelf package written in a high performance language (C++, Rust) doesn’t saturate your hardware (leverage * efficiency)

GPU programming and custom kernels

How a GPU works

Components of a GPU and How it Processes Data

Thread Programming Story

Neural Network Thread Programming Story

Custom Kernels (bonus)