Building a neural network has a lot of components, and it can become overwhelming when trying to decide what to use when, where, how, and why. I struggled with this, and it helped to frame it in an analogy.
Recently, I was at the Getty Villa in Malibu, which is an almost-exact replica of a wealthy Roman merchant’s home. Inside, hundreds of years of Roman and Greek art is available for viewing. The prized possession of the collection is this statue of Heracles:
Now, we don’t officially know if the statue was carved from a single marble block. But that is ok and even beneficial for our purposes. We want to understand how neural networks can be shaped, just like this statue from one marble block (one dataset) or many marble blocks put together (many datasets that are ensembled).
This analogy is not beginner-friendly and assumes some experience building models. It is not perfect, but has served me well. If you persevere, it may help you understand how to build these models systematically, or at least, with some sequence that can help you make better decisions.
I. Target function = the sculptors vision
The sculptor, before starting, ultimately has a clear vision of what the statue should look like, feel like, and be. They have a reference model in their head for what the ideal form should look like. The processes of strokes, tunes, cuts, or chips are about achieving this idealized image. This is the same as the target function- the relationship that we are attempting to discover from the input dataset.
II. Input = marble block
The raw data is the unshaped marble which contains everything that needs to be removed- the noise and irrelevant patterns.
III. Hidden layers = stages of shaping
Each hidden layer removes a bit of unnecessary information (material). The early layers (chips, strokes) remove large, rough chunks, then deeper layers (movements, strokes) refine details, leaving only features and small signals that matter. This leaves a highly-detailed, accurate, and beautiful final form, where the observer can tell a lot of effort went into the piece because of the small details.*
IV. Weights and biases = the artists tools and techniques
The artist’s chisel size and corresponding hand pressure are equivalent to the neural networks weights and biases. These parameters control how much to remove from the marble (dataset) and from where. The weights and biases, just like the pressure, angle and depth from each chisel stroke, are tuned over many, many passes.
V. Activation function = artistic style, decision rules and taste
In the end, it is the artist’s taste and decision-making ruleset that determines what should be left over after every stroke. Because the weights and biases determine how much to chisel and from where, they are tools of the artist’s taste- extensions of their experience, downstream of the vision of what should be left after every stroke. The activation function is the neural network’s taste. It determines what information is worthy of being leftover after making the stroke, chiseling away a piece, or calculating the weights and biases. It is the filter for which signal passes through.
VI. Loss/objective function = critique of reality vs idealized vision
The artist, after every substantial change, steps back to compare how far the current sculpture differs from the idealized version in their head (the target function). In the neural network training process, this is calculated after the forward pass, where, after all the shaping has been done in that specific step, the output is compared to the target function (the idealized version in the artist’s head).
VII. Optimizer = feedback mechanism
Once the artist has made the critique on how much closer their stroke got them to achieving their idealized version of the sculpture, they then need to adjust their tools, pressure, and technique for the next iteration (the next forward pass). This is the backward pass- where the weights and biases are optimized towards the target function (ideal version) after each forward pass.
VIII. Output = the evolving structure
The output is the visible result after each cycle of shaping, critiquing, adjustment, and refinement. The output is continuously updated through many, many iterations of chipping away noise/excess marble until nothing but the original vision is left (the target function/relationship between the information in the dataset).
And even then, the marble statue will need refinement- a new arm, coating, and other fixes over time against entropy. The same is with the neural network. The process of refinement is never-ending, especially if you can attach a real-world feedback loop to improve its results. The neural network evolves overtime as data capturing tools and techniques advance.
*In the real world, especially in luxury goods, there is a mistake often made- confusing complexity with beauty. Just because something has a lot of parts does not make it beautiful. It must be simple, but detailed enough to recognize the excruciating energy and time that went into making it. That is what makes someone feel something. Not the number of components and intricacies between them, but the amount of blood, sweat and tears that were shed to produce the end result. As an aside, the feeling of beauty may just be an acknowledgment of incredible sacrifice and remarkably high standards.