The core primitives all programs are built on

After building a lot of software for the past few months, it’s gotten to the point where it is predictable. Sure, I need to fix a lot and go deeper on language syntax and blah blah blah. But I like when things are simple. Because with these simple tools, you can build an incredibly complex system. Just like everything in nature.

When I first learned Javascript, we had to learn map(), reduce(), and filter(). I didn’t really understand why we had to, besides the fact they were native functions and taught you about the algorithms that were being executed under the hood. But in hindsight, this was excellent training in understanding primitives. Every program written can be boiled down to the core functions below, including ChatGPT.

I. The core 80/20 primitives

Primitive	Mental role	Common names
Map	do the same work on every element in a data sttructure	`parfor`, elementwise, “embarrassingly parallel”
Reduce	combine many results into one	sum, max, product, all-reduce
Scan	compute running totals or prefixes	prefix-sum, cumulative ops
Filter	keep only elements that pass a specific test	stream compaction, predicate masking
Gather	reorder or move data by index	index-based load/store
Sync	wait until all threads catch up	join, barrier, fence

II. Domain specific

Derived primitive	Built from	Used for
Stencil / Convolution	Map + Gather + Reduce	PDEs, image filters, CNNs
Segmented Scan / Reduce	Scan + Map	sparse or grouped data
Sort / Merge	Filter + Scan + Gather	ordering, joins
Histogram / Bucket	Map + Reduce + Scan	distributions, quantization
Prefix / Stream Ops	Scan + Map	streaming, temporal data
Broadcast / All-to-All / ReduceScatter	Gather + Scatter + Barrier	distributed systems (multi-GPU)
Task Queue / Work Stealing	Map + Barrier + atomic ops	dynamic load balancing
Atomic Update	Reduce (at memory-level granularity)	locks, accumulators, counters

Libraries and compilers (TensorFlow, PyTorch, cuDNN, Thrust, NumPy, SQL, Spark) translate our high-level code into combinations of those primitives on the hardware’s execution graph.

We never see it, but:

torch.sum(x) → a reduce kernel
torch.cumsum(x) → a scan kernel
df.groupby() → sort + scan + reduce
GPU sort() → radix scan + scatter

Likewise, parallel algorithms (sorting, FFTs, matrix multiply, convolution, BFS, joins, etc) are just patterns built from those primitives. For example:

Sorting = Filter + Scan + Gather (radix) or Merge + Reduce (mergesort)
Matrix multiply = Map (element products) + Reduce (sum across k)
Graph traversal = Filter (frontier) + Map (expand) + Scan (prefix offsets)
ML training = Map (compute gradients) + Reduce (aggregate)

In fact, a core reason why ML is so parallelizable is because you are running Map over a large set of input values. The computations that are being executed on each of those is similar enough and small enough to be reusable, hence each thread in a GPU can handle it.