After building a lot of software for the past few months, it’s gotten to the point where it is predictable. Sure, I need to fix a lot and go deeper on language syntax and blah blah blah. But I like when things are simple. Because with these simple tools, you can build an incredibly complex system. Just like everything in nature.
When I first learned Javascript, we had to learn map()
, reduce()
, and filter()
. I didn’t really understand why we had to, besides the fact they were native functions and taught you about the algorithms that were being executed under the hood. But in hindsight, this was excellent training in understanding primitives. Every program written can be boiled down to the core functions below, including ChatGPT.
I. The core 80/20 primitives
Primitive | Mental role | Common names |
Map | do the same work on every element in a data sttructure | parfor , elementwise, “embarrassingly parallel” |
Reduce | combine many results into one | sum, max, product, all-reduce |
Scan | compute running totals or prefixes | prefix-sum, cumulative ops |
Filter | keep only elements that pass a specific test | stream compaction, predicate masking |
Gather | reorder or move data by index | index-based load/store |
Sync | wait until all threads catch up | join, barrier, fence |
II. Domain specific
Derived primitive | Built from | Used for |
Stencil / Convolution | Map + Gather + Reduce | PDEs, image filters, CNNs |
Segmented Scan / Reduce | Scan + Map | sparse or grouped data |
Sort / Merge | Filter + Scan + Gather | ordering, joins |
Histogram / Bucket | Map + Reduce + Scan | distributions, quantization |
Prefix / Stream Ops | Scan + Map | streaming, temporal data |
Broadcast / All-to-All / ReduceScatter | Gather + Scatter + Barrier | distributed systems (multi-GPU) |
Task Queue / Work Stealing | Map + Barrier + atomic ops | dynamic load balancing |
Atomic Update | Reduce (at memory-level granularity) | locks, accumulators, counters |
Libraries and compilers (TensorFlow, PyTorch, cuDNN, Thrust, NumPy, SQL, Spark) translate our high-level code into combinations of those primitives on the hardware’s execution graph.
We never see it, but:
torch.sum(x)
→ a reduce kerneltorch.cumsum(x)
→ a scan kerneldf.groupby()
→ sort + scan + reduceGPU sort()
→ radix scan + scatter
Likewise, parallel algorithms (sorting, FFTs, matrix multiply, convolution, BFS, joins, etc) are just patterns built from those primitives. For example:
- Sorting = Filter + Scan + Gather (radix) or Merge + Reduce (mergesort)
- Matrix multiply = Map (element products) + Reduce (sum across k)
- Graph traversal = Filter (frontier) + Map (expand) + Scan (prefix offsets)
- ML training = Map (compute gradients) + Reduce (aggregate)
In fact, a core reason why ML is so parallelizable is because you are running Map over a large set of input values. The computations that are being executed on each of those is similar enough and small enough to be reusable, hence each thread in a GPU can handle it.