Bradley Woolf
Bradley Woolf
The core primitives all computer programs are built on

The core primitives all computer programs are built on

After building a lot of software for the past few months, it’s gotten to the point where it is predictable. Sure, I need to fix a lot and go deeper on language syntax and blah blah blah. But I like when things are simple. Because with these simple tools, you can build an incredibly complex system. Just like everything in nature.

When I first learned Javascript, we had to learn map(), reduce(), and filter(). I didn’t really understand why we had to, besides the fact they were native functions and taught us about the algorithms that were being executed under the hood. But in hindsight, this was excellent training in understanding primitives. Every program written can be boiled down to the core functions below, including ChatGPT.

I. The core 80/20 primitives

Primitive
Role
Common names
Map
do the same work on every element in a data sttructure
parfor, elementwise, “embarrassingly parallel”
Reduce
combine many results into one
sum, max, product, all-reduce
Scan
compute running totals or prefixes
prefix-sum, cumulative ops
Filter
keep only elements that pass a specific test
stream compaction, predicate masking
Gather
reorder or move data by index
index-based load/store
Sync
wait until all threads catch up
join, barrier, fence

II. Domain specific

Derived primitive
Built from
Used for
Stencil / Convolution
Map + Gather + Reduce
PDEs, image filters, CNNs
Segmented Scan / Reduce
Scan + Map
sparse or grouped data
Sort / Merge
Filter + Scan + Gather
ordering, joins
Histogram / Bucket
Map + Reduce + Scan
distributions, quantization
Prefix / Stream Ops
Scan + Map
streaming, temporal data
Broadcast / All-to-All / ReduceScatter
Gather + Scatter + Barrier
distributed systems (multi-GPU)
Task Queue / Work Stealing
Map + Barrier + atomic ops
dynamic load balancing
Atomic Update
Reduce (at memory-level granularity)
locks, accumulators, counters

Libraries and compilers (TensorFlow, PyTorch, cuDNN, Thrust, NumPy, SQL, Spark) translate our high-level code into combinations of those primitives on the hardware’s execution graph.

We never see it, but:

  • torch.sum(x) → a reduce kernel
  • torch.cumsum(x) → a scan kernel
  • df.groupby() → sort + scan + reduce
  • GPU sort() → radix scan + scatter

Likewise, parallel algorithms (sorting, FFTs, matrix multiply, convolution, BFS, joins, etc) are just patterns built from those primitives. For example:

  • Sorting = Filter + Scan + Gather (radix) or Merge + Reduce (mergesort)
  • Matrix multiply = Map (element products) + Reduce (sum across k)
  • Graph traversal = Filter (frontier) + Map (expand) + Scan (prefix offsets)
  • ML training = Map (compute gradients) + Reduce (aggregate)

In fact, a core reason why ML is so parallelizable is because you are running Map over a large set of input values. The computations that are being executed on each of those is similar enough and small enough to be reusable, hence each thread in a GPU can handle it.