This is the second in a two-part series; part one is here.
This has been a tough piece to finish — not because of the subject itself, which is super-fun, but because I keep getting distracted by unexpected behavior I want to understand. At nearly every turn, there’s something neat to see in this little world of evolving 2D cellular automata we’ve created. So bear with me as I try to boil down a lot of wandering into a few key points. There will be pictures!
Vertical Stripes and Hyperparameters
And the end of part one we taught our organisms to “black out” the grid — a simple task that could be optimally achieved with a single rule — and they did great. For the next few rounds I’ve made the goal a bit more difficult: turn the grid into a set of vertical one-pixel stripes, alternating black and white.
Our first fitness calculation for this is pretty straightforward: the first stripe can be either black or white, and the total number of correct pixels is divided by total pixels to get a fraction. Using a Von Neumann neighborhood and conservative parameters, the outcome was … horrible. Over three runs (details here, here and here):

Green is the best performance, red the worst and blue the average. A few pops but results regressed to 0.5 on every run — which is effectively a random grid (one out of every two pixels correct).
My first thought was, perhaps we’re just not getting enough variation. So let’s start tweaking the hyperparameters, i.e., the values that drive evolution. Mutation rate is an easy one, so we’ll increase that from 0-5% to 5-10% on each reproduction. Three more runs (here, here and here):

No love. Our changes did make a difference — there are more “pops” as we find potentially good solutions, but they don’t last and we regress again back to 0.5. But why? My next theory was that perhaps good solutions were being lost because they weren’t consistent. That is, a “random” rule is likely to get around 0.5 every time. But a rule that produces perfect stripes most of the time may perform terribly once in awhile. This corresponds nicely with real life — we don’t (usually) kick a decades-long good performer to the curb for a single failure.
To account for this I added a hyperparameter LastFitnessWeight, which attributes some fraction of fitness from the last iteration to the current one — the idea being that a success yesterday will lift your score today even if it’s an off day. Setting this to 25% gave these results (here, here and here):

Sad trombone noise. This is getting annoying — maybe the middle one showed some increased consistency, but really that’s just wishful thinking.
What we’re seeing here is one of the first rules (and a bit of a dirty secret) of digital evolution, and machine learning in general — hyperparameters don’t matter nearly as much as it seems like they should. With the right features and feedback you almost can’t help but succeed — and without them you’re usually hosed.
Fitness matters
Our fitness metric seems to make perfect sense — we know what each pixel should be, so the more pixels that are “correct,” the closer we are to a solution. But it turns out that that’s not quite right. Let’s look more closely at the history of one organism that did really well and then imploded:

This organism is the offspring of two parents that were basically generating random fields. About half of their pixels were correct, giving them fitness around 0.5 (see the blue highlights). For some reason this match created a really capable organism that for its first two generations delivered absolutely perfect (yellow highlight) scores — amazing!
But look what happened in the third generation (green highlight). It’s visually obvious that this is still a pretty good result, but because of the column skip on the left side (the double-wide white bar), all the pixels to the right were incorrect, so this promising organism was killed off (even with the history-preserving hyperparameter).
Tyranny of the mediocre
The end result of this dynamic is that over time the “interesting” organisms get squeezed out by mediocre but consistent ones (in particular all-white and all-black). This page details the final cycle of one such run: short-lived mostly random organisms at the top, newly-born random ones at the bottom, and a huge swath of 0.5 fitness blanks in the middle.
We can address this in two ways — both are pretty effective. The first is to simply use a better fitness metric. VStripesCombo combines two measures for a more balanced assessment:
- “Stripey-ness” assesses the average length of a correct vertical stripe.
- “Even-ness” rewards an even split between black and white pixels.
With this new metric, a solid block has fitness 0.25 (.5 for stripey-ness, 0 for even-ness), “interesting” organisms have a chance to succeed, and stripes emerge quickly. Finally, some success (here, here and here):

Another approach is to be more picky about who gets to reproduce. Our initial implementation kills off the bottom third of the population with each cycle, allowing the top two-thirds to reproduce. Since two-thirds includes that middle belt of consistent mediocrity, it can persist and grow.
Instead we can kill off the bottom half of the population, and allow each organism in the top half to mate twice. Just as with biological siblings, each mating crosses over and mutates differently, providing more chances for the strengths of the parents to compound.
As it turns out, this mode of reproduction also wins the day (here, here and here):

Strategies and weaknesses
The hallmark of evolved learning is solutions that our conscious, logical minds would never think of and often can’t really comprehend even after the fact. It’s frankly a little spooky. To wit, watch this organism solve the vertical stripes problem from random, along with the rules it employs. WTF man? (I have to say I do love the back and forth “wiggle” once it hits a final solution.)
All of these organisms were trained from a random starting grid. Running a few of them (all winners during training) from a single black pixel in the middle highlights two things: (1) their strategies are wildly divergent; (2) sometimes a strategy that tends to work in one case is an utter fail with a different starting configuration (last two examples below):
That second point can’t be overstated: you get what you train for — and we didn’t train for a single pixel initial state. Environment, fitness, reproduction rules, they all are critically important to the final product. This is going to come up again and again in the emerging world of AI. LLMs hallucinate because they have been rewarded for answering questions, not for saying they don’t know. We’d better get really, really good at this if we’re going to make it as a species (some more thoughts on that here).
You only know what you know
OK, enough with the stripes. For our next trick, let’s try to learn how to draw a frame around the edges of the grid — all white except for a one pixel rim around the edge. Seems pretty simple! Results are here, here and here:

Doh. It’s not even that it just doesn’t learn well — it doesn’t seem to learn at all. No matter what we do or how we define things, we can’t crack this nut. Why?
The answer is simple but important: there is simply zero information in the system about what an “edge” even is. Remember that the neighborhood computations “wrap” around so the grid appears to be an infinite plane. The edges are obvious to us when we draw the grid, but completely invisible to the organisms living inside it.
And you can’t “learn” something that you can’t perceive — it’s impossible, like asking a completely blind person to raise their hand when the lights come on. You can be mad about it, but it is what it is. This is surprisingly easy to forget, because evolved organisms are so good and finding subtle and non-obvious patterns, we just assume they’re omniscient. Nope.
OK, so let’s add an “edge” sense to our organisms by defining a new “relative” type in the Neighborhood class. When we include this new sense in our neighborhood, magic happens (here):

It’s a simple example, and perhaps not that shocking — by providing the boolean “edge” value, we enable the organism to effectively keep two sets of rules: one for the edges (turn them black) and one for everything else (turn them white).
But still, it’s cool. Just for fun, here’s a slightly less obvious example. By adding senses for which half of the grid a point is in (North/South, East/West), we can easily learn rules that expect different content in each quadrant (details here):

OK, that’s enough of a random walk for now. I could do this stuff forever, and each new lesson really does say something about evolution and learning in the real world. I hope I’ve put in enough eye candy to keep you entertained along the way, but even if I didn’t — it was good for me.
Wait just one more! I’ve been trying to teach some organisms how to split the grid diagonally, which proves to be a tough challenge. My best run so far is 5,000 cycles to get to a pretty consistent 0.95 fitness … but it don’t look great, folks. It feels like it has the right idea, but can’t settle into place (e.g., check out the lower-left quadrant here). Any ideas?
