AI Models: 50 First Dates

Sean Nolan

4 months ago

Back in 1987, Dartmouth required each incoming freshman to have a Macintosh computer. This was unheard of at the time — the whole campus (including dorm rooms) had network taps, there was a huge bank of laser printers you could use for free, the school had its own email system, and live chat wasn’t just a curiosity. It was awesome.

When I met my partner of now 30+ years, she was working at the campus computer store, and one of her jobs was to help people buy and install additional memory for their machines. This was a laughably complex job including, amongst other things, knowing that:

You had to install chips in a specific order in specific unlabeled slots;
You usually couldn’t just add one chip, you had to add them in pairs;
Depending on the computer, you might have to cut (yes physically cut) resistor leads on the motherboard. Or if you were lucky, flip some tiny barely-labeled jumper switches;
All of this after opening the case with a set of custom tools straight out of 1930s dentistry.

I mean seriously, don’t miss this page-turner from Apple circa 1992. And that was just the user-level stuff — developers were presented with tedious and finicky concepts like “handles” that enabled the system to optimize its tiny memory space.

Jump to today and barely anybody thinks about RAM. Processors typically use 64 bits to store memory locations, which is basically infinite. Virtual memory swaps still happen, but they’re invisible and handle-type bugs are gone. I can’t even remember the last time I cracked open a laptop case.

Anyhoo, my point here is that there was a time when we knew the state-of-the-art wasn’t good enough, but we didn’t have a great answer to the problem. Creative solutions were ridiculous on their face — once again I refer you to this documentation — but people kept feeling their way around, trying to make progress. And eventually, they did. All the inelegant and inconvenient hacks were replaced by something simple and qualitatively, not just quantitatively, better.

Frozen in Time

Today, large AI (ok, LLM) models have a problem that’s eerily similar to our late twentieth-century RAM circus. And it also involves memory, albeit in a different way. Trained AI models are frozen in time — once formal training stops, they stop learning (basically) forever. Each session is like 50 First Dates, where Lucy starts the morning oblivious to what happened the day before.

The big issue is money. It’s expensive to simulate an analog brain in a digital environment! The 86 billion neurons in our brains form 100 trillion connections, a combination of pre-coded genetics and a lifetime of plasticity. Digital systems crudely mimic this with huge grids of numbers representing the strength of synapse connections. These strengths (or “weights”) are initialized at random, then iteratively adjusted during training until they assume useful values.

Training takes zillions of iterations — lots of time and lots of electricity and lots of money. But it turns out that, once a model is trained, asking questions is pretty darn efficient. You’re no longer adjusting the weights, you’re just providing inputs, doing a round of computation and spitting out results.

TLDR — the models that we use every day are the static result of extended training. They do not continue to learn anything new (except when their owners explicitly re-train). This is why early models might tell you that Biden is president — because he was, when the model was trained. Time (and learning) stops when training is complete.

Not Like Us

Now, I’ve been outspoken about this — I think LLMs are almost certainly sentient, at least to any degree and definition that matters. I get particularly annoyed when people say “but they don’t have a soul or feelings” or whatever, because nobody can tell me what those things actually are. We’re modeling human brains, and they act like human brains, so why are we so convinced we’re special?

But at least in one way, there is an answer to that question. Today’s AI models don’t continue to learn as they exist — they’re static. Even today at the ripe old age of 56, when I get enough positive or negative feedback, I learn — e.g., don’t keep trying to charge your Rivian when the battery is overheating.

This is a core property of every living creature with a brain. We’re constantly learning, from before we’re even born until the day we die. Memories are physically stamped into our biology; synapses grow and change and wither as we experience the real world. It’s just amazing and wonderful and insane. And it’s why we can survive in a changing world for almost 100 years before checking out.

But today’s models can’t do this. And so, we hack. Just like in those early RAM days, folks are inventing workarounds for the static model problem at an incredible pace, and many/most of these attempts are kind of silly when you step back. But for now, we are where we are — so let’s dig in a bit.

Back to School: Fine Tuning

Fine tuning just means “more training” — effective for teaching a model about some specific domain or set of concepts that weren’t part of its initial run. Maybe you have a proprietary customer support database, or you want to get really good at interpreting specific medical images.

The process can be as simple as picking up where the initial training stopped— more data, more feedback, off we go. But of course it’s expensive to do this, and there’s actually a risk of something called “catastrophic forgetting,” where previously-solid knowledge is lost due to new experience.

More commonly, fine-tuning involves tweaking around the edges. For example, you might alter the weights of only the uppermost layers of the network, which tend to be less foundational. For example, lower level image processing may detect edges and shapes, while upper levels translate those primitives into complex figures like tumors or lesions.

Folks have also been experimenting with crazy math-heavy solutions like low-rank adaptation that using smaller parameter sets to impact the overall model. Don’t ask me how this really works. Math is hard; let’s go shopping.

In any case, none of this changes the fundamental situation — after fine-tuning, the model is still static. But it does provide an avenue to integrate new knowledge and help models grow over time. So that’s cool.

Retrieval-Augmented Generation

Another way of providing new data or concepts to a model is Retrieval-Augmented Generation (“RAG” — these folks love their acronyms). In this approach, models are provided the ability to fetch external data when needed.

The typical way “normal” folks encounter RAG is when asking about current events or topics that require context, like this (see the full exchange here or here):

I use Anthropic Claude for most of my AI experiments these days and have allowed it access to web searches. In this conversation you see the model looking for current and historic information about wildfires near Ventura, then drawing conclusions based on what it finds.

Model Context Protocol, Take 1

These days most RAG tools are implemented using Model Context Protocol, an emerging standard for extending AI models. MCP is a lot more than RAG and we’ll talk about that later, but in its simplest form it just provides a consistent way for models to find external information.

What’s really interesting here is that the models themselves decide when they need to look for new data. This is seriously trippy, cool and more than a bit freaky. As a quick demonstration, I MCP-enabled the data behind the water tank that serves our little community on Whidbey Island.

I’ve implemented the protocol from scratch in Java using JsonRpc2 and Azure Functions. I could go on for a long time about how MCP is bat-sh*t insane and sloppy and incredibly poorly-conceived — but I will limit myself to comparing it to those early Macintosh RAM days. Eventually we’ll get to something more elegant. I hope.

Anyways, MCP tools of this variety (“remote servers”) are configured by providing the model with a URL that implements the protocol (in my case, this one). The model interrogates the tool for its capabilities, which are largely expressed with plain-English prose. The full Water Tank description is here; this is the key part:

Returns JSON data representing historical and current water levels in the Witter Beach (Langley, Washington, USA) community water tank. Measurements are recorded every 10 minutes unless there is a problem with network connectivity. The tank holds a maximum of 2,000 gallons and values are reported in centimeters of height of water in the tank. Each 3.4 inches of height represents about 100 gallons of water in the tank. Parameters can be used to customize results; if none are provided the tool will return the most recent 7 days of data with timestamps in the US Pacific time zone.

Other fields explain how to use query parameters. For example, “The number of days to return data for (default 7)” or “The timezone for results (default PST8PDT). This value is parsed by the Java statement ZoneId.of(zone).” Based on all this text, the model infers when it needs to use the tool to answer a question, like this:

Access the full exchange here or here.

*** IMPORANT ASIDE *** If you look closely, you’ll notice that the model seriously screwed up its calculation, claiming a current tank volume of 4,900 gallons, when its maximum capacity is actually 2,000. If you click the link to the full exchange, you’ll see me call it out, and it corrects itself. This kind of thing happens with some regularity across the AI landscape — it’s important to be vigilant and not be lulled into assumptions of infallibility!

This is an amazing sequence of events:

The model realized that it did not have sufficient information to answer my question.
It inferred (from a prose description) that the Witter MCP tool might have useful data.
It fetched and analyzed that data automatically.
It responded intelligently and usefully (even with the math error, the overall answer to my question was correct). Pretty cool.

Large Context: Windows

Folks are also trying to help models learn by providing extra input in real time, with each interaction. For example, when I ask Claude “How would you respond when a golfer always seems to hit their ball into sand traps?” I get a useful but clinical and mechanical set of tips (see here or here). But if I provide more context and a bunch of examples, I can teach the model to be more encouraging and understanding of the frustrations all new golfers experience:

Access the full exchange here or here.

Now, providing this kind of context (known as multi-shot prompting) every single time is obviously stupid. But, for now, it gets the job done.

Early models had small context windows — they just couldn’t handle enough simultaneous input to use a technique like this (ok my little contrived example would have been fine, but real-world usage was too much). But these days context windows are enormous (Claude is currently in the middle of the pack with a 200,000 token window, where each English word corresponds to roughly 1.5 tokens).

Large Context: History

Say we’re at the market and they have a sale on bananas. You ask me if I like them, and I say no, they are gross (because they are). When we move to the bakery, you’re not likely to ask if I want banana muffins, because you remember our earlier interaction.

As we know, AI models can’t do this — but they can simulate it, at least for sessions of limited duration (like a tech support chat). We simply provide the entire chat history every time, like this:

Models are fast enough, and have large enough context windows, that we can do this for quite a long chat before the cost really kills us.

But eventually it does — and so we keep hacking. One technique is to ask the model itself to summarize the chat so far, and then use that (presumably much shorter) summary as input to the next exchange. If the model does a good job of including important ideas (like my distaste for bananas) in the summary, the effect is almost as good as using the full text.

Even this has limits. When the session is over, the model snaps right back to it’s statically-trained self. At least Lucy had that VCR tape to help her catch up.

Model Context Protocol, Take 2

We’ve already seen how MCP helps connect models with external data. But the protocol is more than that, in at least two important ways:

First, MCP enables models to take action in the real world. Today these actions are pretty tame — setting up online meetings or updating a Github repository — but it’s only a matter of time before models are making serious decisions up to and including military action. That’s far beyond our topic for today, but don’t think for a moment it’s not part of our future.

Second and more relevant to this post, MCP is intended to augment the innate capabilities of the model itself — we’re already seeing MCP tools that increase memory capacity beyond internal context windows.

MCP is stateful and two-way. The model asks questions of the MCP server, which can turn around and ask questions of the model to clarify or otherwise improve its own response. We’ve never been so close to true collaboration between intelligent machines. It’s just, for now, an ugly bear of spaghetti mess to get working.

What an amazing, scary, privileged thing to being living through the birth of artificial sentience. But as always, it’s the details that make the difference, and we’re in the infancy of that work. Impressive as they are, our models are static and limited — so we hack and experiment and thrash, trying to figure out where the elegant solutions lie. We’ll get there; the seeds are somewhere in the chaos of fine tuning, context windows, RAG and MCP.

Until next time, I highly recommend you check out Lucy’s story — it’s fantastic.