Back when I was in college, getting a Computer Science degree meant taking a bunch of somewhat advanced math courses. My math brain topped out at Calc2, so clearly I was going to have to work the system somehow. Thus was born my custom-made “Cognitive Science” degree, a combination of the cool parts of Psychology with the cool parts of CS. Woot! My advisor in the degree was Jamshed Bharucha, who has done a ton of really cool work trying to understand how we perceive music.

In retrospect it was an awesome time to be learning about artificial intelligence. Most approaches still didn’t work very well (except in limited domains). The late-80s hype around expert systems had petered out, and the field overall was pretty demoralized. But blackboards and perceptrons were still bopping around, and I was particularly enamored with the stuff that Rodney Brooks (eventually of Roomba fame) was doing. What was great for a student was that all of these ideas were still relatively ** simple** — you could intuitively talk about how they worked, and the math was approachable enough that you could actually implement them. Today it’s much harder to develop that kind of intuition from first principles, because everything is about mind-numbing linear algebra and layer upon layer of derivatives (on the other hand, today the algorithms actually work I guess).

Most notably for me, the classic 1986 backpropagation paper by Rumelhart / Hinton / Williams was just gaining traction as I was learning all of this. Backprop basically restarted the entire field and, coupled with Moore’s Law, set the stage for the pretty incredible AI performance we take for granted today. Dr. Bharucha saw this happening, and tapped me to write a graphical neural net simulator on the Mac that we used to teach classes. Sadly, while you can still find a few osbscure mentions of DartNet around the web (including a tiny screenshot), it seems that the code is lost — ah well.

Nostalgia aside, I have been noodling an idea for a project that would require a network implementation. There are a metric ton of really, really good open source options to choose from, but I realized I didn’t really remember the details how it all worked, and I don’t like that. So with the help of the original paper and some really nice, simple reference code I set about to get refreshed, and figured that others might enjoy a “101” as well, so here we go.

## Real Neurons are Really Cool

We do our thinking thanks to cells called Neurons. In combination they are unfathomably complex, but individually they’re pretty simple, at least at a high level. Neurons basically have three parts:

- The cell body or
**soma**, which holds the nucleus and DNA and is a lot like like any other cell. **Dendrites**, branching tendrils which extend from the cell body and receive signals*from*other cells.- The
**axon**, a (usually) single long extension from the cell body that sends signals*to*other cells.

Neurons are packed together so that axons from some neurons are really close to dendrites from others. When a neuron is “activated”, its axon releases chemicals called neurotransmitters, which travel across the gap (called a synapse) to nearby dendrites. When those dendrites sense neurotransmitters, they send an electrical pulse up towards their cell body. If enough dendrites do this at the same time, that neuron also “activates”, sending the pulse up the axon which responds by releasing more neurotransmitters. And she tells two friends, and she tells two friends… and suddenly you remember your phone number from 1975.

It doesn’t quite work this way, but imagine you’ve got a neuron in your head dedicated to recognizing Ed Sheeran. Dendrites from this neuron might be connected to axons from the neuron that recognizes redheads, and the one for Cherry Seaborn, and the one for British accents, and dozens of others. No single dendrite is enough to make the Ed Sheeran neuron fire; it takes a critical mass of these inputs firing at the same time to do the job. And some dendrites are more important than others — the “shabbily dressed” neuron probably nudges you towards recognizing Ed, but isn’t nearly as powerful as “hearing Galway Girl”.

Pile up enough neurons with enough connections and you end up with a brain. “Learning” is just the process of creating synapses and adjusting their strengths. All of our memories are encoded in these things too. They’re why I think of my grandparents’ house every time I smell petrichor, and why I start humming Rocky Mountain High whenever I visit Overlake Hospital. It’s just too freaking amazing.

## Fake Neurons

People have been thinking about how to apply these concepts to AI since the 1940s. That evolution itself is fascinating but a bit of a side trip. If we fast-forward to the early 1980s, the state of the art was more-or-less represented in Minsky and Papert’s book Perceptrons. In brief (and I hope I don’t mess this up too much):

- Coding a fake neuron is pretty easy.
- Coding a network of fake neurons is also pretty easy, albeit computationally intense to run.
- Two-layer, fully-connected networks that link “input” to “output” neurons can learn a lot of things by example, but their scope is limited.
- Multi-layer networks that include “hidden” neurons between the inputs and outputs can solve many more problems.
- But while we understood how to train the networks in #3, we didn’t know how to train the hidden connections in #4.

The difference between the networks in #3 and #4 is about “linearity”. Imagine your job is to sort a pile of random silverware into “forks” and “spoons”. Unfortunately, you discover that while many pieces are pretty obviously one or the other, there are also a bunch of “sporks” in the mix. How do you classify these sporkish things? One super-nerdy way would be to identify some features that make something “forky” vs. “spoony” and plot examples on a chart (hot tip: whenever you see somebody building “graphs” in PowerPoint as I’ve done here, you should generally assume they’re full of crap):

If we measure the “tine length” and “bowl depth” of each piece, we can plot it on this graph. And lo and behold, we can draw a straight line (the dotted one) across this chart to separate the universe quite accurately into forks and spoons. Sure, the true sporks in the lower-left are tough, as are weirdo cases like the “spaghetti fork” represented by the mischaracterized red dot on the right. But by and large, we’ve got a solid classifier here. The dividing line itself is pretty interesting — you can see that “tine length” is far more “important” to making the call than the bowl depth. This makes intuitive sense — I’ve seen a lot of shaped forks, but not a lot of spoons with long tines.

This kind of classification is called “linear regression,” and it is super-super-powerful. While it’s hard for us to visualize, it will work with any number of input parameters, not just two. If you imagine adding a third dimension (z axis) to the chart above, a flat plane could still split the universe in two. A whole bunch of AI you see in the world is based on multi-dimensional linear classifiers (even the T-Detect COVID-19 test created by my friends at Adaptive).

But there are a bunch of things linear classifiers can’t do — a good example being complex image recognition (dog or fried chicken?). Enter the multi-layered neural network (#4 in the list above). Instead of straight lines, these networks can draw distinctions using complex curves and even disjoint shapes. Super-cool … except that back in the early 80s we didn’t know how to train them. Since carefully hand-crafting a network with thousands of connections is laughable, we were kind of stuck.

I already gave away the punchline — in 1986 some super-smart folks solved this dilemma with a technique they called “backpropagation.” But before we dig into that, let’s look a little more closely at how artificial neural nets are typically put together.

## Network Layers and Forward Propagation

I alluded to the fact that our brains are generally a jumble of interconnected neurons. Some connections are predetermined, but most come about as we learn stuff about the world. The interconnectedness is massively complex — our artificial versions are much simpler, because we’re just not as smart as Nature. Still, there is a lot going on.

Fake neurons are arranged into “layers”, starting with the input layer. This input layer is where features (tine length, etc.) are presented to the system, usually as floating point numbers and ideally normalized to a consistent range like 0 to 1 (normalizing the inputs lets the network assess the importance of each feature on its own). The last layer in a network is the “output” layer, which is where we read out results. The output layer might be a single neuron that provides a yes/no answer; or it might be a set of neurons, each of which assesses the probability of the inputs representing a particular thing, or something in between.

In between these two layers is usually at least one “hidden” layer. The number of neurons in these layers is up to the network designers — and there aren’t a ton of “rules” about what will work best in any specific situation. This is true of most “hyperparameters” used to tune AI systems, and selection usually comes down somewhere between a random guess and trying a whole bunch to see what performs the best. And we think we’re so smart.

Every neuron in layer N is connected via an artificial synapse to every neuron in layer N + 1. The “strength” of each synapse is represented by a floating-point value called a “weight”. Generating an output from a given set of inputs is called a “forward pass” or “forward propagation” and works like this:

- Assign each input value to the input neurons.
- For each neuron
*N*in the first hidden layer,- For each neuron
*N’*in the layer below,- Calculate the value sent from N’ to N by multiplying the value of N’ with the weight of the synapse between them.

- Sum these values together to get total input value for neuron N.

- Add the “bias” value of N to the sum. Intuitively this bias allows each neuron to have a “base level” importance to the system.

- Apply an “activation” function to the sum to determine the final output value of N (see discussion of activation functions below).

- For each neuron
- Repeat step 2 for each subsequent network layer until the output layer is reached.

Activation functions are interesting. In a very practical sense, we need to normalize the output values of each neuron — if we didn’t, the “sum” part of the algorithm would just keep growing with every layer. Using a non-linear function to perform that normalization enables the non-linear classification we’re trying to build. Remember that real neurons are binary — they either fire or they do not — a very non-linear operation. But artificial networks tend to use something like the sigmoid function (or actually ever more complicated ones these days) that have the added benefit of being efficient at a learning approach called gradient descent (I know, more terms … sorry, we’ll get there).

It’s hard to describe algorithms in English. Hopefully that all made sense, but said more simply: artificial neural networks arrange neurons in layers. Activation of the neurons at each layer is calculated by adding up the activations from the layer below, scaled by weights that capture the relative importance of each synapse. Functions transform these values into final activations that result in non-linear output values. That’s good enough.

## Training the Network / Enter Backpropagation

Backprop is one of those algorithms that I can fight through with pen and paper and really understand for about five seconds before it’s lost again. I take some pride in those five seconds, but I wish I was better at retaining this stuff. Ah well — I can at least hold onto an intuitive sense of what is going on — and that’s what I’ll share here.

We can train a network by showing it a whole bunch of input samples where we know what the output should be (this is called supervised learning). The network is initialized with a set of totally random weights and biases, then a forward pass is done on the first sample. If we subtract the (pretty much random) outputs we get from the correct/expected results, we get an “error” value for each output node. Our goal is to use that error plus a technique called “gradient descent” to adjust the weights coming into the node so that the total error is smaller next time. Then we run the other samples the same way until the network either gets smart or we give up.

Gradient descent is a very simple idea. Considering one synapse (weight) in our network, imagine a chart like the one here that plots all the possible weight values against the errors they produce. Unless the error is totally unrelated to the weight (certainly possible but then it is all random and what’s the meaning of life anyways), you’ll end up with a curve, maybe one like the dotted line shown below. Our job is to find the weight value that minimizes error, so we’re trying to hit that lower trough where the green dot is.

If our initial random stab is the red dot, we want to move the weight “downhill” to the left. We don’t know how far, but we can see that the slope of the curve is pretty steep where we are, so we can take a pretty big step. But oops, if we go too far we end up missing the bottom and land somewhere like the gold dot. That’s ok — the slope is shallower now, so we’ll try again, taking a smaller step to the right this time. And we just keep doing that, getting closer and closer to the bottom where we want to be.

Alas, the big problem with gradient descent is represented by the purple dot, called a “local minimum”. If our initial random weight puts us near that part of the curve, we might accidentally follow it downhill to the purple and get “stuck” because the slope there is zero and we never take a big enough step to escape. There are various ways to minimize (ha) this problem, all of which amount in practice to jiggling the dot to try to and shake it loose. Fun stuff, but I’m just going to ignore it here.

Anyways, it turns out that something called the “chain rule” lets us figure out the rate of change of the error at each output node with respect to each incoming weight value. And once we know that, we can use gradient descent to adjust those weights just like we did with the red dot. And it also enables us to iteratively distribute errors through the lower layers, repeating the process. I would just embarrass myself trying to explain all the math that gets us there, but I grasped it just long enough to implement it here.

Again trying to wrap all this up, in short we train a network by (a) computing how poorly it performs on known input/output combinations, (b) divvying up that error between the synapses leading to the outputs and using that to update weight values, then (c) iteratively pushing the error to the next lower layer in the network and repeating the process until we get to the bottom. Do this enough times and we end up with a network that (usually) does a good job of classification, even on inputs it hasn’t seen before.

## Show Me the Money (Code)

Matrix math bugs me. OK, mostly I’m jealous of the way other folks toss around “transposes” and “dot products” and seemingly know what they’re talking about without sketching out rows and columns on scrap paper. I suspect I’m not alone. But it turns out that having a solid Matrix class *really* simplifies the code required for working with fake neurons. So that’s where we’ll start, in Matrix.java. There is absolutely nothing exciting in this file — it just defines a Matrix as a 2D array of doubles and provides a bunch of super-mechanical operations and converters. I like the vibe of the iterate and transform methods, and it’s helpful to understand how I coded up equality tests, but really let’s move on.

Network.java is where all the magic really happens. Network.Config defines parameters and can also fully hydrate/dehydrate the state of weights and biases. One thing to be careful of — I put in a little hook to provide custom activation functions, but right now the code ignores that and always uses sigmoid. Beyond all of that housekeeping, there are three bits of code worth a closer look: forwardPass, trainOne and trainAndTest:

There are two versions of the forwardPass method: a public one that just returns an output array, and an internal one that returns activation values for all neurons in the network. That internal one does the real work and looks like this:

private List<Matrix> forwardPassInternal(double[] input) { List<Matrix> results = new ArrayList<Matrix>(); results.add(new Matrix(input, 0, cfg.Layers[0])); for (int i = 0; i < weights.size(); ++i) { Matrix layer = weights.get(i).multiply(results.get(i)); layer.add(biases.get(i)); layer.transform(v -> activation.function(v)); results.add(layer); } return(results); }

The “results” list has one entry for each layer in the network, starting with input and ending with output. Each entry is a Matrix, but keep in mind that it’s really just a simple array of activation values for each neuron at that layer (rows = # of neurons, columns = 1). We initialize the list by copying over the input activation values, then iterate over each layer computing its activation values until we get to the output. This is just an actual implementation of the forward propogation pseudocode we discussed earlier.

Training is also just a few lines of code, but it is a bit harder on the brain:

public void trainOne(double[] vals) { // forwardprop List<Matrix> results = forwardPassInternal(vals); // backprop Matrix errors = new Matrix(vals, numInputs(), numInputs() + numOutputs()); errors.subtract(results.get(results.size() - 1)); for (int i = weights.size() - 1; i >= 0; --i) { // figure out the gradient for each weight in the layer Matrix gradient = new Matrix(results.get(i+1)); gradient.transform(v -> activation.derivative(v)); gradient.scale(errors); gradient.scale(cfg.LearningRate); // do this before updating weights errors = weights.get(i).transpose().multiply(errors); // the actual learning part! Matrix weightDeltas = gradient.multiply(results.get(i).transpose()); weights.get(i).add(weightDeltas); biases.get(i).add(gradient); } }

The input to this function is a single array that holds both input and expected output values. Having both in one array is kind of crappy from an interface design point of view, but you’ll see later that it makes some other code a lot easier to manage. Just hold in your head that the inputs start at index 0, and the expected outputs start at index numInputs().

In brief, we take the output from forwardPassInternal and compute errors at the output layer. We then iterate backwards over each set of synapses / weights, computing the rate of change of each error with respect to its incoming weight, scaling that by our learning rate and the incoming activation, and finally adjusting the weights and bias. All of this crap is where the Matrix operations actually help us stay sane — but remember underneath each of them is just a bunch of nested array traversals.

If you’re still with me, the last important bit is really just scaffolding to help us run it all. I won’t copy all of that code here, but to help you navigate:

- Input is provided to the method with a TrainAndTestConfig object that defines key parameters and points at the training data file. The data file must be in TSV (tab-separated value text) format, with one row per test case. The columns should be inputs followed by expected outputs — all double values. Note you can provide additional columns to the right that will be passed through to output — these are meaningless to the algorithms but can be a useful tracking tool as we’ll see in the Normalization section later.
- The HoldBackPercentage specifies how much of the training set should be excluded from training and used to test performance. If this value is “0”, we train
**and**test with the full set. This is useful for simple cases, but is typically considered bad practice because we’re trying to build a generalized model, not just one that can spit back cases its seen before. The train and test sets are selected randomly. - Once we get the train and test sets figured out, starting at line 412 we finally instantiate a network and train for the number of iterations specified in the configuration file. The code tries to cover all of the training cases while keeping the order of presentation random; probably could be better here but it does the job.
- Then at line 428 we run each row from the testSet and produce an array that contains the inputs, outputs, expected outputs, computed errors and (if provided) extra fields. That array gets written to standard output as a new TSV. If cfg.FinalNetworkStatePath is non-null, we dehydrate the network to yet another file, and we’re done.

## Let’s Run It Already

You can pretty easily build and run this code yourself. You’ll need git, maven and a JDK installation, then just run:

git clone https://github.com/seanno/shutdownhook.git cd shutdownhook/toolbox && mvn clean package install cd ../evolve && mvn clean package cd datasets ./trainAndTest.sh xor

Here’s some slightly-abridged output from doing just that:

“XOR” is the classic “hello-world” case for backpropagation. It’s a non-linear function that takes two binary inputs and outputs “1” when exactly one input is 1, otherwise “0” (represented by in0, in1, and exp0 in the left highlighted section above). The test files xor-config.json and xor-data.tsv in the datasets directory configure an XOR test that uses a network with one hidden layer of eight neurons, trains over 100,000 iterations and tests with the full data set.

Our little network did pretty good work! The “out0” column shows final predictions from the network, which are very very close to the ideal 0 and 1 values. The right-side highlight gives a good sense of how the network learned over time. It shows the average error at intervals during training: our initial random weights were basically a coin flip (.497), with rapid improvement that flattens out towards the end.

I mentioned earlier that setting “hyperparameters” in these models is as much an art as a science. It’s fun to play with the variables — try changing the learning rate, the number of hidden layers and how many neurons are in each of them, and so on. I’ve found I can burn a lot of hours twiddling this stuff to see what happens.

## Normalization

So XOR is cute, but not super-impressive — let’s look at something more interesting. Way back in 1987 Jeff Schlimmer extracted observable data on a bunch of poisonous and edible mushrooms from the Audobon Society Field Guide. More recently in 2020, some folks refreshed and expanded this dataset and have made the new version available under a Creative Commons license — 61,069 training samples, woo hoo! Many thanks to Wagner, Heider, Hattab and all of the other folks that do the often-underappreciated job of assembling data. Everybody loves sexy algorithms, but they’re useless without real-world inputs to test them on.

OK, go big or go home — let’s see if our network can tell us if mushrooms are poisonous. Before we can do that, though, we need to think about normalization.

The mushroom data set has twenty input columns — some are measurements like “cap-diameter,” others are labels like “gill-spacing” (“c” = close; “w” = crowded; “d” = distant). But our Network model requires that all inputs are floating-point values. Before we can train on the mushroom set, we’ll have to somehow convert each input variable into a double.

The measurements are already doubles, so you might think we can just pass them through. And we can, but there’s a problem. An example — the maximum cap-diameter in the data set is about 62cm, while the max stem-height is just over half of that at 34cm. If we pass through these values unaltered, we will bias the network to treat cap-diameter as more important to classification than stem-height, simply because we’re just pushing more activation into the machine for one vs the other.

To be fair, even if we are naïve about this, over time the network should learn to mute the effect of “louder” inputs. But it would take a ton of training iterations to get there — much better to start from an even playing field. We do this by normalizing all of the inputs to a consistent range like 0 to 1. This is simple to do — just find the minimum and maximum values for each numeric input, and scale inputs at runtime to fit within those bounds. Not too bad.

But what about the label-based inputs like “gill-spacing?” “Close”, “crowded” and “distant” don’t naturally fit any numeric scale — they’re just descriptions of a feature. One option is to search the input for all unique values in each set, and then evenly distribute between 0 and 1 for each one, e.g.,:

- Close = c =
**0** - Crowded = w =
**0.5** - Distant = d =
**1**

The only problem with this approach is that the numbers are assigned in arbitrary order. Remember back to forks and spoons — we’re trying to find lines and curves that segment the input space into categories, which works best when “similar” inputs are close to each other. This makes intuitive sense — a mushroom with a 1cm cap-diameter is more likely to be related to one measuring 1.2cm vs 40cm.

In the gill-spacing case above, the order shown is actually pretty good. But how would you arrange a “cap-surface” value of fibrous, grooved, scaly or smooth? Sometimes we just have to do the best we can and hope it all works out. And surprisingly, with neural networks it usually does.

Of course, in most real-world data sets you also have to decide how to deal with missing or malformed data in your set. But we’ve covered a ton of ground already; let’s leave that for another day.

Normalize.java has our implementation of all of this as used for the mushroom dataset. It tries to auto-detect column types, assigns values, and outputs the normalized data to a new file. It can also provide a data dictionary to aid in reverse engineering of network performance.

## So are we mushroom experts now?

I ran trainAndTest.sh with the normalized data and a network configured for two hidden layers of forty neurons and a 5% holdback — the results after 2M training iterations are pretty amazing! The graph shows average error by iteration along the way — just as with XOR we start with a coinflip and finish far better; our average error is just 0.009. If we bucket our outputs using .5 as the threshold, we’d be incorrect in only 8 out of 3,054 samples in the test set — a 99.7% success rate — not too shabby.

Of the eight we missed, the good news I suppose is that we failed “the right way” — we called edible mushrooms poisonous rather than vice versa. Unfortunately this wasn’t always true in other runs — I saw cases where the network blew it both ways. So do you eat the mushroom using a classifier that is 99.7% accurate? It turns out that in this case you don’t have to face that hard question — the authors got to 100% success using a more complex learning model called a “random forest” — but risk and philosophical grey areas always lurk around the edges of this stuff, just like with human experts.

Whew — a lot there! But you know, we’re only caught up to about 1990 in the world of artificial intelligence. The last few years especially have seen an explosion in new ideas and improved performance. I am perfectly confident reaching into the back seat to grab a soda while my Tesla does the driving for me — at 75mph on a crowded highway. Seriously, can you believe we live in a world where that happens? I can’t wait to see what comes next.