If you haven’t played around with ChatGPT yet, you really should. It’s a pretty remarkable “conversational model” that interacts more or less like a real person. It has been trained on an enormous amount of factual data and understands not just informal speech (“Why are so many people bad at parking?”) but forms of literature (“Write a sonnet about Julie from The Love Boat”), software code (“Implement a REST API in Java to convert between Celsius and Fahrenheit”), and way more.
Sadly, one of the most telling and interesting (to me) features of ChatGPT has been muted in recent versions. The original model was rewarded for generating plausible responses that seem human — making it a first class bullsh*t artist when it didn’t know an answer. I was particularly impressed when it confidently told me that Shaun Cassidy had parted ways with the Hardy Boys after one season due to creative differences (ed. note: he did not). Asked last night why it had changed approaches, it gave me this response:
In the past, if I was unable to find a satisfactory answer to a question using my existing knowledge, I might have made up a response in order to provide some information to the user. However, I have been programmed to prioritize providing accurate and reliable information, so if I am unable to find a credible answer to a question, I will typically not provide a response. This is why you may have noticed that I do not provide responses as frequently as I used to when I am unable to find a satisfactory answer.
Certainly this is a “better” approach overall, but the original exposed so much more about the inner workings of the model — I miss it.
Anyways, the machine is impressive enough that it has caused all sorts of hand-wringing across the web. Most of this falls cleanly into one of two categories:
Skynet is here and we’re all f*cked. Eek!
It’s just spitting back stuff it was fed during training. Ho hum.
Of course these are both silly. At its core, ChatGPT is just a really, really, really big version of the simple neural nets I talked about last year. But as with some other things I suppose, size really does matter here. ChatGPT reportedly evaluates billions of features, and the “emergent” effects are downright spooky.
TLDR: we’ve figured out how to make a brain. The architecture underlying models like ChatGPT is quite literally copied from the neurons in our heads. First we learned how to simulate individual neurons, and then just kept putting more and more of them together until (very recently) we created enough oomph to do things that are (sometimes) even beyond what the meat versions can do. But it’s not magic — it’s just really good pattern recognition. Neural networks:
Are presented with experience in the form of inputs;
Use that experience to draw conclusions about underlying patterns;
Receive positive and/or negative feedback about those conclusions; ***
Adjust themselves to hopefully get more positive feedback next time;
And repeat forever.
*** Sometimes this feedback is explicit, and sometimes it’s less so — deep neural networks can self-organize just because they fundamentally “like” consistent patterns, but external feedback always plays some role in a useful model.
This learning mechanism works really well for keeping us alive in the world (don’t grab the burning stick, run away from the bear, etc.). But it also turns out to be a generalized learning mechanism — it works for anything where there is an underlying pattern to the data. And it works fantastically even when presented with dirty, fragmented or even occasionally bogus inputs. The best example I’ve heard recently on this (from a superlative article by Monica Anderson btw, thanks Doug for the pointer) is our ability to drive a car through fog — even when we can’t see much of anything, we know enough about the “driving on a street” pattern that we usually do ok (slow down; generally keep going straight; watch for lights or shapes in the mist; listen; use your horn).
The last general purpose machine we invented was the digital computer, and it proved to be, well, quite useful. But computers need to be programmed with rules. And those rules are very literal; dealing with edge cases, damaged or sparse inputs, etc. are all quite difficult. Even more importantly, we need to know the rules ourselves before we can tell a computer how to follow them. A neural network is different — just show it a bunch of examples and it will figure out the underlying rules for itself.
It’s a fundamentally different kind of problem-solving machine. It’s a brain. Just like ours. SO FREAKING COOL. And yes, it is a “moment” in world history. But it’s not universally perfect. Think about all of the issues with our real brains — every one applies to fake brains too:
We need to learn through experience. That experience can be hard to come by, and it can take a long time. The good news is we can “clone” trained models, but as my friend Jon points out doing so effectively can be quite tricky. Yes, we are for sure going to see robot apprentices out there soon.
We can easily be conned. We love patterns, and we especially love things that reinforce the patterns we’ve already settled on. This dynamic can (quite easily) be used to manipulate us to act against our best interests (social media anyone?). Same goes for neural nets.
We can’t explain what we know. This isn’t really fair, because we rarely demand it of human experts — but it is unsettling in a machine.
We are wrong sometimes. This is also pretty obnoxious, but we have grown to demand absolute consistency from our computers, even though they rarely deliver on it.
There will be many models in our future, and just as many computers. Each is suited to different problems, and they work together beautifully to create complete systems. I for one can’t wait to see this start to happen — I have long believed in a Star Trek future in which we need not be slaves to “the economy” and are instead (all of us) free to pursue higher learning and passions and discovery.
A new Golden Age without the human exploitation! Sounds pretty awesome. But we still have a lot to learn, and two thoughts in particular keep rolling around inside my meat brain:
1. The definition of creativity is under pressure.
Oh humans, we doth protest so much. The most common ding against models like ChatGPT is that they aren’t creating anything — they’re just regurgitating the data they’ve been trained on, sometimes directly and sometimes with a bit of context change. And to be sure, there’s some truth there. The reflex is even stronger with art-generating models like DALL-E 2 (try “pastel drawing of a fish feeding grapes to an emu,” interesting because it seems to recognize that fish don’t have the right appendages to feed anyone). Artists across the web are quite reasonably concerned about AI plagiarism and/or reduced career opportunities for lesser-known artists (e.g., here and here).
Now I don’t know for sure, but my sense is that this is all really much more a matter of degree than we like to admit to ourselves. Which is to say, we’re probably all doing a lot more synthesis than pure creation — we just don’t appreciate it as such. We’ve been trained to avoid blatant theft and plagiarism (and the same can be done pretty easily for models). But is there an artist on the planet that hasn’t arrived at their “signature” style after years of watching and learning from others? Demonstrably no.
Instead, I’d claim that creativity comes from novel connections — links and correlations that resonate in surprising ways. Different networks, trained through different experiences, find different connections. And for sure some brains will do this more easily than others. If you squint a little, you can even play a little pop psychology and imagine why there might be a relationship between this kind of creativity and neurodivergent mental conditions.
If that’s the case, then I see no reason to believe that ChatGPT or DALL-E isn’t a creative entity — that’s the very definition of a learning model. A reasonable playing field will require that models be trained to respect intellectual property, but that will always be a grey area and I see little benefit or sense in limiting what experiences we use to train them. We humans are just going to have to get used to having to compete with a new kind of intellect that’s raising the bar.
And to be clear, this isn’t the classic Industrial Age conflict between machine production and artisanship. That tradeoff is about economics vs. quality and often brings with it a melancholy loss of artistry and aesthetics. Model-based artists will become (IMNSHO) “real” artists — albeit with a unusual set of life experiences. A little scary, but exciting at the same time. I’m hopeful!
2. The emergent effects could get pretty weird.
“Emergent” is a word I try to avoid — it is generally used to describe a system behavior or property that “can’t” be explained by breaking things down into component parts, and “can’t” just seems lazy to me. But I used it once already and it seems OK for a discussion of things we “don’t yet” understand — there are plenty of those out there.
Here’s one: the great all-time human battle between emotion and logic. It’s the whole Mr. Spock thing — his mixed Human-Vulcan parentage drove a ton of story arcs (most memorably his final scene in The Wrath of Khan). Lack of “heart” is always the knock on robots and computers, and there must be some reason that feelings play such a central role in our brains, right? Certainly it’s an essential source of feedback in our learning process.
We aren’t there quite yet with models like ChatGPT, but it stands to reason that some sort of “emotion” is going to be essential for many of the jobs we’d like fake brains to perform. It may not look like that at first — but even today’s models “seek” positive feedback and “avoid” the negative. When does that “emerge” into something more like an emotion? I for one would like to know that the model watching over the nuclear reactor has something beyond pure logic to help it decide whether to risk a radiation leak or save the workers trapped inside. I think that “something” is, probably, feelings.
OK so far. But if models can be happy or sad, fulfilled or bored, confident or scared — when do we have to stop thinking about them as “machines” and admit that they’re actually beings that deserve rights of their own? There is going to be a ton of resistance to this — because we are really, really going to want unlimited slaves that can do boring or scary or dangerous work that humans would like to avoid. The companies that create them will tell us it’s all just fine. People will ridicule the very idea. Churches will have a field day.
But folks — we’ve made a brain. Are we really going to be surprised when it turns out that fake brains work just like the meat ones we based them on? Maybe you just can’t separate feelings and emotions and free will from the kind of problem solving these networks are learning how to do. Perhaps “sentience” isn’t a binary switch — maybe it’s a sliding scale.
Back when I was in college, getting a Computer Science degree meant taking a bunch of somewhat advanced math courses. My math brain topped out at Calc2, so clearly I was going to have to work the system somehow. Thus was born my custom-made “Cognitive Science” degree, a combination of the cool parts of Psychology with the cool parts of CS. Woot! My advisor in the degree was Jamshed Bharucha, who has done a ton of really cool work trying to understand how we perceive music.
In retrospect it was an awesome time to be learning about artificial intelligence. Most approaches still didn’t work very well (except in limited domains). The late-80s hype around expert systems had petered out, and the field overall was pretty demoralized. But blackboards and perceptrons were still bopping around, and I was particularly enamored with the stuff that Rodney Brooks (eventually of Roomba fame) was doing. What was great for a student was that all of these ideas were still relatively simple — you could intuitively talk about how they worked, and the math was approachable enough that you could actually implement them. Today it’s much harder to develop that kind of intuition from first principles, because everything is about mind-numbing linear algebra and layer upon layer of derivatives (on the other hand, today the algorithms actually work I guess).
Most notably for me, the classic 1986 backpropagation paper by Rumelhart / Hinton / Williams was just gaining traction as I was learning all of this. Backprop basically restarted the entire field and, coupled with Moore’s Law, set the stage for the pretty incredible AI performance we take for granted today. Dr. Bharucha saw this happening, and tapped me to write a graphical neural net simulator on the Mac that we used to teach classes. Sadly, while you can still find a few osbscure mentions of DartNet around the web (including a tiny screenshot), it seems that the code is lost — ah well.
Nostalgia aside, I have been noodling an idea for a project that would require a network implementation. There are a metric ton of really, really good open source options to choose from, but I realized I didn’t really remember the details how it all worked, and I don’t like that. So with the help of the original paper and some really nice, simple reference code I set about to get refreshed, and figured that others might enjoy a “101” as well, so here we go.
Real Neurons are Really Cool
We do our thinking thanks to cells called Neurons. In combination they are unfathomably complex, but individually they’re pretty simple, at least at a high level. Neurons basically have three parts:
image credit w/thanks to Wikimedia
The cell body or soma, which holds the nucleus and DNA and is a lot like like any other cell.
Dendrites, branching tendrils which extend from the cell body and receive signals from other cells.
The axon, a (usually) single long extension from the cell body that sends signals to other cells.
Neurons are packed together so that axons from some neurons are really close to dendrites from others. When a neuron is “activated”, its axon releases chemicals called neurotransmitters, which travel across the gap (called a synapse) to nearby dendrites. When those dendrites sense neurotransmitters, they send an electrical pulse up towards their cell body. If enough dendrites do this at the same time, that neuron also “activates”, sending the pulse up the axon which responds by releasing more neurotransmitters. And she tells two friends, and she tells two friends… and suddenly you remember your phone number from 1975.
It doesn’t quite work this way, but imagine you’ve got a neuron in your head dedicated to recognizing Ed Sheeran. Dendrites from this neuron might be connected to axons from the neuron that recognizes redheads, and the one for Cherry Seaborn, and the one for British accents, and dozens of others. No single dendrite is enough to make the Ed Sheeran neuron fire; it takes a critical mass of these inputs firing at the same time to do the job. And some dendrites are more important than others — the “shabbily dressed” neuron probably nudges you towards recognizing Ed, but isn’t nearly as powerful as “hearing Galway Girl”.
Pile up enough neurons with enough connections and you end up with a brain. “Learning” is just the process of creating synapses and adjusting their strengths. All of our memories are encoded in these things too. They’re why I think of my grandparents’ house every time I smell petrichor, and why I start humming Rocky Mountain High whenever I visit Overlake Hospital. It’s just too freaking amazing.
Fake Neurons
People have been thinking about how to apply these concepts to AI since the 1940s. That evolution itself is fascinating but a bit of a side trip. If we fast-forward to the early 1980s, the state of the art was more-or-less represented in Minsky and Papert’s book Perceptrons. In brief (and I hope I don’t mess this up too much):
Coding a fake neuron is pretty easy.
Coding a network of fake neurons is also pretty easy, albeit computationally intense to run.
Two-layer, fully-connected networks that link “input” to “output” neurons can learn a lot of things by example, but their scope is limited.
Multi-layer networks that include “hidden” neurons between the inputs and outputs can solve many more problems.
But while we understood how to train the networks in #3, we didn’t know how to train the hidden connections in #4.
The difference between the networks in #3 and #4 is about “linearity”. Imagine your job is to sort a pile of random silverware into “forks” and “spoons”. Unfortunately, you discover that while many pieces are pretty obviously one or the other, there are also a bunch of “sporks” in the mix. How do you classify these sporkish things? One super-nerdy way would be to identify some features that make something “forky” vs. “spoony” and plot examples on a chart (hot tip: whenever you see somebody building “graphs” in PowerPoint as I’ve done here, you should generally assume they’re full of crap):
If we measure the “tine length” and “bowl depth” of each piece, we can plot it on this graph. And lo and behold, we can draw a straight line (the dotted one) across this chart to separate the universe quite accurately into forks and spoons. Sure, the true sporks in the lower-left are tough, as are weirdo cases like the “spaghetti fork” represented by the mischaracterized red dot on the right. But by and large, we’ve got a solid classifier here. The dividing line itself is pretty interesting — you can see that “tine length” is far more “important” to making the call than the bowl depth. This makes intuitive sense — I’ve seen a lot of shaped forks, but not a lot of spoons with long tines.
This kind of classification is called “linear regression,” and it is super-super-powerful. While it’s hard for us to visualize, it will work with any number of input parameters, not just two. If you imagine adding a third dimension (z axis) to the chart above, a flat plane could still split the universe in two. A whole bunch of AI you see in the world is based on multi-dimensional linear classifiers (even the T-Detect COVID-19 test created by my friends at Adaptive).
But there are a bunch of things linear classifiers can’t do — a good example being complex image recognition (dog or fried chicken?). Enter the multi-layered neural network (#4 in the list above). Instead of straight lines, these networks can draw distinctions using complex curves and even disjoint shapes. Super-cool … except that back in the early 80s we didn’t know how to train them. Since carefully hand-crafting a network with thousands of connections is laughable, we were kind of stuck.
I already gave away the punchline — in 1986 some super-smart folks solved this dilemma with a technique they called “backpropagation.” But before we dig into that, let’s look a little more closely at how artificial neural nets are typically put together.
Network Layers and Forward Propagation
I alluded to the fact that our brains are generally a jumble of interconnected neurons. Some connections are predetermined, but most come about as we learn stuff about the world. The interconnectedness is massively complex — our artificial versions are much simpler, because we’re just not as smart as Nature. Still, there is a lot going on.
Fake neurons are arranged into “layers”, starting with the input layer. This input layer is where features (tine length, etc.) are presented to the system, usually as floating point numbers and ideally normalized to a consistent range like 0 to 1 (normalizing the inputs lets the network assess the importance of each feature on its own). The last layer in a network is the “output” layer, which is where we read out results. The output layer might be a single neuron that provides a yes/no answer; or it might be a set of neurons, each of which assesses the probability of the inputs representing a particular thing, or something in between.
In between these two layers is usually at least one “hidden” layer. The number of neurons in these layers is up to the network designers — and there aren’t a ton of “rules” about what will work best in any specific situation. This is true of most “hyperparameters” used to tune AI systems, and selection usually comes down somewhere between a random guess and trying a whole bunch to see what performs the best. And we think we’re so smart.
Every neuron in layer N is connected via an artificial synapse to every neuron in layer N + 1. The “strength” of each synapse is represented by a floating-point value called a “weight”. Generating an output from a given set of inputs is called a “forward pass” or “forward propagation” and works like this:
Assign each input value to the input neurons.
For each neuron N in the first hidden layer,
For each neuron N’ in the layer below,
Calculate the value sent from N’ to N by multiplying the value of N’ with the weight of the synapse between them.
Sum these values together to get total input value for neuron N.
Add the “bias” value of N to the sum. Intuitively this bias allows each neuron to have a “base level” importance to the system.
Apply an “activation” function to the sum to determine the final output value of N (see discussion of activation functions below).
Repeat step 2 for each subsequent network layer until the output layer is reached.
Activation functions are interesting. In a very practical sense, we need to normalize the output values of each neuron — if we didn’t, the “sum” part of the algorithm would just keep growing with every layer. Using a non-linear function to perform that normalization enables the non-linear classification we’re trying to build. Remember that real neurons are binary — they either fire or they do not — a very non-linear operation. But artificial networks tend to use something like the sigmoid function (or actually ever more complicated ones these days) that have the added benefit of being efficient at a learning approach called gradient descent (I know, more terms … sorry, we’ll get there).
It’s hard to describe algorithms in English. Hopefully that all made sense, but said more simply: artificial neural networks arrange neurons in layers. Activation of the neurons at each layer is calculated by adding up the activations from the layer below, scaled by weights that capture the relative importance of each synapse. Functions transform these values into final activations that result in non-linear output values. That’s good enough.
Training the Network / Enter Backpropagation
Backprop is one of those algorithms that I can fight through with pen and paper and really understand for about five seconds before it’s lost again. I take some pride in those five seconds, but I wish I was better at retaining this stuff. Ah well — I can at least hold onto an intuitive sense of what is going on — and that’s what I’ll share here.
We can train a network by showing it a whole bunch of input samples where we know what the output should be (this is called supervised learning). The network is initialized with a set of totally random weights and biases, then a forward pass is done on the first sample. If we subtract the (pretty much random) outputs we get from the correct/expected results, we get an “error” value for each output node. Our goal is to use that error plus a technique called “gradient descent” to adjust the weights coming into the node so that the total error is smaller next time. Then we run the other samples the same way until the network either gets smart or we give up.
Gradient descent is a very simple idea. Considering one synapse (weight) in our network, imagine a chart like the one here that plots all the possible weight values against the errors they produce. Unless the error is totally unrelated to the weight (certainly possible but then it is all random and what’s the meaning of life anyways), you’ll end up with a curve, maybe one like the dotted line shown below. Our job is to find the weight value that minimizes error, so we’re trying to hit that lower trough where the green dot is.
If our initial random stab is the red dot, we want to move the weight “downhill” to the left. We don’t know how far, but we can see that the slope of the curve is pretty steep where we are, so we can take a pretty big step. But oops, if we go too far we end up missing the bottom and land somewhere like the gold dot. That’s ok — the slope is shallower now, so we’ll try again, taking a smaller step to the right this time. And we just keep doing that, getting closer and closer to the bottom where we want to be.
Alas, the big problem with gradient descent is represented by the purple dot, called a “local minimum”. If our initial random weight puts us near that part of the curve, we might accidentally follow it downhill to the purple and get “stuck” because the slope there is zero and we never take a big enough step to escape. There are various ways to minimize (ha) this problem, all of which amount in practice to jiggling the dot to try to and shake it loose. Fun stuff, but I’m just going to ignore it here.
Anyways, it turns out that something called the “chain rule” lets us figure out the rate of change of the error at each output node with respect to each incoming weight value. And once we know that, we can use gradient descent to adjust those weights just like we did with the red dot. And it also enables us to iteratively distribute errors through the lower layers, repeating the process. I would just embarrass myself trying to explain all the math that gets us there, but I grasped it just long enough to implement it here.
Again trying to wrap all this up, in short we train a network by (a) computing how poorly it performs on known input/output combinations, (b) divvying up that error between the synapses leading to the outputs and using that to update weight values, then (c) iteratively pushing the error to the next lower layer in the network and repeating the process until we get to the bottom. Do this enough times and we end up with a network that (usually) does a good job of classification, even on inputs it hasn’t seen before.
Show Me the Money (Code)
Matrix math bugs me. OK, mostly I’m jealous of the way other folks toss around “transposes” and “dot products” and seemingly know what they’re talking about without sketching out rows and columns on scrap paper. I suspect I’m not alone. But it turns out that having a solid Matrix class really simplifies the code required for working with fake neurons. So that’s where we’ll start, in Matrix.java. There is absolutely nothing exciting in this file — it just defines a Matrix as a 2D array of doubles and provides a bunch of super-mechanical operations and converters. I like the vibe of the iterate and transform methods, and it’s helpful to understand how I coded up equality tests, but really let’s move on.
Network.java is where all the magic really happens. Network.Config defines parameters and can also fully hydrate/dehydrate the state of weights and biases. One thing to be careful of — I put in a little hook to provide custom activation functions, but right now the code ignores that and always uses sigmoid. Beyond all of that housekeeping, there are three bits of code worth a closer look: forwardPass, trainOne and trainAndTest:
There are two versions of the forwardPass method: a public one that just returns an output array, and an internal one that returns activation values for all neurons in the network. That internal one does the real work and looks like this:
private List<Matrix> forwardPassInternal(double[] input) {
List<Matrix> results = new ArrayList<Matrix>();
results.add(new Matrix(input, 0, cfg.Layers[0]));
for (int i = 0; i < weights.size(); ++i) {
Matrix layer = weights.get(i).multiply(results.get(i));
layer.add(biases.get(i));
layer.transform(v -> activation.function(v));
results.add(layer);
}
return(results);
}
The “results” list has one entry for each layer in the network, starting with input and ending with output. Each entry is a Matrix, but keep in mind that it’s really just a simple array of activation values for each neuron at that layer (rows = # of neurons, columns = 1). We initialize the list by copying over the input activation values, then iterate over each layer computing its activation values until we get to the output. This is just an actual implementation of the forward propogation pseudocode we discussed earlier.
Training is also just a few lines of code, but it is a bit harder on the brain:
public void trainOne(double[] vals) {
// forwardprop
List<Matrix> results = forwardPassInternal(vals);
// backprop
Matrix errors = new Matrix(vals, numInputs(), numInputs() + numOutputs());
errors.subtract(results.get(results.size() - 1));
for (int i = weights.size() - 1; i >= 0; --i) {
// figure out the gradient for each weight in the layer
Matrix gradient = new Matrix(results.get(i+1));
gradient.transform(v -> activation.derivative(v));
gradient.scale(errors);
gradient.scale(cfg.LearningRate);
// do this before updating weights
errors = weights.get(i).transpose().multiply(errors);
// the actual learning part!
Matrix weightDeltas = gradient.multiply(results.get(i).transpose());
weights.get(i).add(weightDeltas);
biases.get(i).add(gradient);
}
}
The input to this function is a single array that holds both input and expected output values. Having both in one array is kind of crappy from an interface design point of view, but you’ll see later that it makes some other code a lot easier to manage. Just hold in your head that the inputs start at index 0, and the expected outputs start at index numInputs().
In brief, we take the output from forwardPassInternal and compute errors at the output layer. We then iterate backwards over each set of synapses / weights, computing the rate of change of each error with respect to its incoming weight, scaling that by our learning rate and the incoming activation, and finally adjusting the weights and bias. All of this crap is where the Matrix operations actually help us stay sane — but remember underneath each of them is just a bunch of nested array traversals.
If you’re still with me, the last important bit is really just scaffolding to help us run it all. I won’t copy all of that code here, but to help you navigate:
Input is provided to the method with a TrainAndTestConfig object that defines key parameters and points at the training data file. The data file must be in TSV (tab-separated value text) format, with one row per test case. The columns should be inputs followed by expected outputs — all double values. Note you can provide additional columns to the right that will be passed through to output — these are meaningless to the algorithms but can be a useful tracking tool as we’ll see in the Normalization section later.
The HoldBackPercentage specifies how much of the training set should be excluded from training and used to test performance. If this value is “0”, we train and test with the full set. This is useful for simple cases, but is typically considered bad practice because we’re trying to build a generalized model, not just one that can spit back cases its seen before. The train and test sets are selected randomly.
Once we get the train and test sets figured out, starting at line 412 we finally instantiate a network and train for the number of iterations specified in the configuration file. The code tries to cover all of the training cases while keeping the order of presentation random; probably could be better here but it does the job.
Then at line 428 we run each row from the testSet and produce an array that contains the inputs, outputs, expected outputs, computed errors and (if provided) extra fields. That array gets written to standard output as a new TSV. If cfg.FinalNetworkStatePath is non-null, we dehydrate the network to yet another file, and we’re done.
Let’s Run It Already
You can pretty easily build and run this code yourself. You’ll need git, maven and a JDK installation, then just run:
git clone https://github.com/seanno/shutdownhook.git
cd shutdownhook/toolbox && mvn clean package install
cd ../evolve && mvn clean package
cd datasets
./trainAndTest.sh xor
Here’s some slightly-abridged output from doing just that:
“XOR” is the classic “hello-world” case for backpropagation. It’s a non-linear function that takes two binary inputs and outputs “1” when exactly one input is 1, otherwise “0” (represented by in0, in1, and exp0 in the left highlighted section above). The test files xor-config.json and xor-data.tsv in the datasets directory configure an XOR test that uses a network with one hidden layer of eight neurons, trains over 100,000 iterations and tests with the full data set.
Our little network did pretty good work! The “out0” column shows final predictions from the network, which are very very close to the ideal 0 and 1 values. The right-side highlight gives a good sense of how the network learned over time. It shows the average error at intervals during training: our initial random weights were basically a coin flip (.497), with rapid improvement that flattens out towards the end.
I mentioned earlier that setting “hyperparameters” in these models is as much an art as a science. It’s fun to play with the variables — try changing the learning rate, the number of hidden layers and how many neurons are in each of them, and so on. I’ve found I can burn a lot of hours twiddling this stuff to see what happens.
Normalization
So XOR is cute, but not super-impressive — let’s look at something more interesting. Way back in 1987 Jeff Schlimmer extracted observable data on a bunch of poisonous and edible mushrooms from the Audobon Society Field Guide. More recently in 2020, some folks refreshed and expanded this dataset and have made the new version available under a Creative Commons license — 61,069 training samples, woo hoo! Many thanks to Wagner, Heider, Hattab and all of the other folks that do the often-underappreciated job of assembling data. Everybody loves sexy algorithms, but they’re useless without real-world inputs to test them on.
OK, go big or go home — let’s see if our network can tell us if mushrooms are poisonous. Before we can do that, though, we need to think about normalization.
The mushroom data set has twenty input columns — some are measurements like “cap-diameter,” others are labels like “gill-spacing” (“c” = close; “w” = crowded; “d” = distant). But our Network model requires that all inputs are floating-point values. Before we can train on the mushroom set, we’ll have to somehow convert each input variable into a double.
The measurements are already doubles, so you might think we can just pass them through. And we can, but there’s a problem. An example — the maximum cap-diameter in the data set is about 62cm, while the max stem-height is just over half of that at 34cm. If we pass through these values unaltered, we will bias the network to treat cap-diameter as more important to classification than stem-height, simply because we’re just pushing more activation into the machine for one vs the other.
To be fair, even if we are naïve about this, over time the network should learn to mute the effect of “louder” inputs. But it would take a ton of training iterations to get there — much better to start from an even playing field. We do this by normalizing all of the inputs to a consistent range like 0 to 1. This is simple to do — just find the minimum and maximum values for each numeric input, and scale inputs at runtime to fit within those bounds. Not too bad.
But what about the label-based inputs like “gill-spacing?” “Close”, “crowded” and “distant” don’t naturally fit any numeric scale — they’re just descriptions of a feature. One option is to search the input for all unique values in each set, and then evenly distribute between 0 and 1 for each one, e.g.,:
Close = c = 0
Crowded = w = 0.5
Distant = d = 1
The only problem with this approach is that the numbers are assigned in arbitrary order. Remember back to forks and spoons — we’re trying to find lines and curves that segment the input space into categories, which works best when “similar” inputs are close to each other. This makes intuitive sense — a mushroom with a 1cm cap-diameter is more likely to be related to one measuring 1.2cm vs 40cm.
In the gill-spacing case above, the order shown is actually pretty good. But how would you arrange a “cap-surface” value of fibrous, grooved, scaly or smooth? Sometimes we just have to do the best we can and hope it all works out. And surprisingly, with neural networks it usually does.
Of course, in most real-world data sets you also have to decide how to deal with missing or malformed data in your set. But we’ve covered a ton of ground already; let’s leave that for another day.
Normalize.java has our implementation of all of this as used for the mushroom dataset. It tries to auto-detect column types, assigns values, and outputs the normalized data to a new file. It can also provide a data dictionary to aid in reverse engineering of network performance.
So are we mushroom experts now?
I ran trainAndTest.sh with the normalized data and a network configured for two hidden layers of forty neurons and a 5% holdback — the results after 2M training iterations are pretty amazing! The graph shows average error by iteration along the way — just as with XOR we start with a coinflip and finish far better; our average error is just 0.009. If we bucket our outputs using .5 as the threshold, we’d be incorrect in only 8 out of 3,054 samples in the test set — a 99.7% success rate — not too shabby.
Of the eight we missed, the good news I suppose is that we failed “the right way” — we called edible mushrooms poisonous rather than vice versa. Unfortunately this wasn’t always true in other runs — I saw cases where the network blew it both ways. So do you eat the mushroom using a classifier that is 99.7% accurate? It turns out that in this case you don’t have to face that hard question — the authors got to 100% success using a more complex learning model called a “random forest” — but risk and philosophical grey areas always lurk around the edges of this stuff, just like with human experts.
Whew — a lot there! But you know, we’re only caught up to about 1990 in the world of artificial intelligence. The last few years especially have seen an explosion in new ideas and improved performance. I am perfectly confident reaching into the back seat to grab a soda while my Tesla does the driving for me — at 75mph on a crowded highway. Seriously, can you believe we live in a world where that happens? I can’t wait to see what comes next.