“Doing my own research”

Sean Nolan

4 months ago

To be clear, the title here is tongue-in-cheek. Real “research” involves carefully-designed and bias-controlled experiments, and there ain’t none of that below. My intended point is just that we’re all capable of digging deeper in ways that haven’t been the case before the advent of LLMs. Arming yourself with these tools is one way to fight the bullsh*t that is pushed at us every single hour of every single day.

A few days ago the Algorithm-capital-A pushed me a video about Bass Pro Shops and how they scam tax discounts by creating fake “museums” in their stores. Turns out that while the shock video version exaggerates the scope of the con, it’s basically true. Nice!

Anyways, what started as a casual attempt to test the veracity of this story ended up as something much more interesting. Yes kids, it’s another AI-positive story, this one hidden behind some observations on the American economy.

Subsidy Tracker

One of the articles about Bass included a link to Subsidy Tracker, a site that combs through public records to identify federal, state and local subsidies by company. This is really messy data; we’re lucky there are non-profits making it usable.

Somehow I wandered from Bass over to the airline industry, where I found a ton of very recent federal grants —millions of dollars every month. Digging into these led me to the Essential Air Service program, and that started me down today’s rabbit hole. Bear with me for a second.

Essential Air Service

See, back in 1978 Jimmy Carter — yes, JIMMY CARTER — signed the Airline Deregulation Act, hoping to decrease fares and increase service by rolling back a bunch of controls on fares and routes. But the bill’s authors realized that without some new intervention, a deregulated airline industry would immediately drop service to smaller, less profitable locations like, say, my college home airport in Lebanon, NH.

They addressed this by creating the EAS and its list of “Essential Air Service Communities.” Airlines are paid real cash money by the federal government to provide regular service to these communities — to the tune of more than half a billion dollars in 2024. For example, Cape Air was paid $5.2M to ensure 54 people a day could fly one-way to or from West Leb. That’s about $2,400 per leg, even if they fly the plane empty!

And you know what? This is fine. Actually, it’s great. We, as a society, decided that we cared about maintaining integration of our rural communities with the rest of the country via passenger air. We also recognized that free market dynamics would not deliver this outcome, because the societal “cost” of not having service was borne outside of the immediate commercial players.

Of course there are risks to this. Collective actions are complicated and always subject to bias and graft — they’re never “optimal.” Our protections are mandated transparency, civil education and a free press. The EAS probably needs some tweaks, but on balance it seems like a pretty good call.

Like it or not, this kind of market-socialism hybrid has been our model pretty much forever — and increasingly so as we’ve become more interdependent through the industrial and information ages.

OK, Cool, Right?

Not so fast, Milton. A huge, possibly majority fraction of our country simply does not understand this long-standing reality. The Reds have spent decades — starting with talk radio in the 80s and culminating with MAGA today — telling people that we live in a perfectly free market economy, and that perfect freedom is the primary reason for the success of our nation. It’s a two-part strategy:

Emphatically label “bad” collective societal action as “communist.” (health care, minimum wage, food and unemployment benefits, UBI, …)
Ignore, bury and obfuscate the “good” action so the public doesn’t notice the hypocrisy. (corporate subsidies, military adventures, incumbent-benefitting pork, …)

The EAS is a great example of this. By definition the vast majority of EAS communities are in rural areas — places that likely supported Trump in the last election. But I’m pretty sure that if you asked residents in those communities if the government was playing to fly empty planes to and from their homes, they’d say (1) no way, and/but (2) we don’t want to give up our airport.

Ask a Simple Question

At this point in the story, I realized I should check my own bias. I mean, of course rural voters went for Trump, but it’s possible that EAS communities were somehow an outlier. So I started poking around for some data that would help me answer that question.

Little asks like this seem so simple! But as anybody who has ever tried to report on real-world data can tell you (say, for example, the DOGE wizards that “concluded” millions of dead people were drawing social security) it’s actually super-hard. First you have to find data — and for a lot of questions, that just doesn’t exist (see my comment at the top about real research), or it’s in an awkward or inconvenient form for analysis. In this case, however, it was pretty easy:

The Dept of Transportation publishes a current list of EAS communities. It’s a PDF, but that’s easy to extract into a CSV file with columns for city and state.
The Harvard Dataverse, another great resource that I hope survives our current funding climate, publishes county-level election data (file citation).

Progress! Often all you need from here is a little basic Excel magic (see here for some tips on that). Unfortunately for us, we hit our first stumbling block: election data is reported at the county level, while the EAS communities are cities. Mapping between those will take a little more data, but luckily that’s available too, compiled from government sources and released under a Creative Commons license: simplemaps US Zip Codes database.

Extract city, state and county columns from this file, match up the city/state with the EAS data, walk that through county to the election data, and Bob’s Your Uncle!

Finally, the AI Part

Well sure, it’s pretty simple in theory. But most of the country doesn’t have the skills to actually write this code. I mean, I’ve spent a career doing this sort of thing, but even so I’m not likely to invest the effort on a random weekend news-scrolling curiosity.

This is where foundational AI models can really change the game for everyone. It’s not without pitfalls, but take a look at what Claude Code was able to do with this prompt:

I’d like to generate a csv file that shows how each county that is considered an eligible community in the Essential Air Service program voted for president in 2024. Please use node and javascript for this script.

Data on EAS eligible communities is in the file eas.tsv. Data that translates city/state to county is in the file uszips.csv. Data that contains county-level presidential elections results is in the file countypres_2000-2024.csv.

You’ll need to read each city/state combination out of eas.tsv, then use uszips.csv to translate that into one or more county/state combinations.

With this information, look up the 2024 election results for those counties, sum up the votes if there are multiple counties, and output a row with the name of the candidate that received the most votes.

If you are unable to translate a city/state to county/state, or if that county/state is not found in the presidential election results, use “unknown” as the name of the winning candidate.

The output should have three columns: the original city/state from the EAS data and then then name of the winning candidate.

Please double-check your work and do not take shortcuts such as estimation or extrapolation. I want to be sure that the data you output represents direct matches only — if the data isn’t clear just say “unknown” and that’s ok.

I put a lot of detail in that prompt because (a) I’d already done the work to figure out data sources; and (b) I wanted to be very clear that the model should be conservative. First try: Winner-Winner-Chicken-Dinner!

More than Mechanical

A machine that writes code to crosswalk a bunch of files is pretty neat, opening up a deeper level of analysis to huge swaths of the population. But it gets really cool when you look under the covers. Review the entire conversation for yourself using this link.

The model wrote code, tested it, and iterated a bunch of times to discover and account for unique quirks in the data. It was a lot! Again, this will sound very familiar to anyone who has tried to do even moderately complex cross-source data analysis:

One file had full state names while the other had abbreviations. Create a lookup table.
The “mode” column is inconsistent. Most counties use “TOTAL VOTES” to represent totals, but some counties leave this blank, others use other terms like “TOTAL VOTES CAST” and others don’t have total rows at all so they need to be created by summing other modes. Normalize the values and created an algorithm that picks the most representative rows.
Some city names were slightly different across files. E.g., “Hot Springs” vs “Hot Springs National Park.” Use partial matching to address.
Spacing and casing differences. Strip spaces and lowercase everything before matching.
Additional differences in punctuation and abbreviation. Use a normalization table.

All of these were found without further prompting or intervention. And as the cherry on top, the model even realized that the two Puerto Rican EAS communities weren’t in the election data because Puerto Ricans can’t vote for president.

Of course, given the state of LLMs today I still wouldn’t just trust the output without reviewing the code and doing some spot checks. In this case at least — did that, and it passed with flying colors.

TLDR, my assumption about Trump voters is backed up by the data. Not earth shattering perhaps, but anything that makes the world a little more fact-based is a Very Good Thing. And most importantly, thanks to LLMs, this kind of research is available to all of us at any time. People love to talk about “brain rot” from AI — but we do that with every innovation. Gen X peeps, remember the uproar about calculators (55378008)? Use it well and it is transformational.

Anyways, if you’re starting your online screed with “I haven’t checked but I bet….” well, shame on you.

OK, but what about Cost and Energy?

It’s very popular to dismiss AI solutions due to their allegedly egregious energy use. The work I did here used 54,116 “tokens” — where a token is a unit of work kind of like a word but not quite. There isn’t a ton of data out there as to how much energy is used during inference, but a broad range between .001 and .01 Watt-hours per 1,000 tokens is cited pretty regularly.

Double that to cover infrastructure costs like cooling, split it down the middle and we can make a crazy rough estimate of .54Wh for the work in this post. That’s about the same as running two Google searches, or running a 10W light bulb for three and a half minutes. To me, this is a shockingly efficient use of energy, even if our guess is off by two or three times.

Ah you say, you can’t just look at inference — model training costs are astronomical. And that is true! But production models typically remain in use for around six to eighteen months before being superseded. Over that timeframe a model will be used for many billions of inferences; training costs quickly amortize to basically zero.

And none of this considers the innovation curve that is already happening to push costs down. Just as with traditional computing power, market forces (ha, get it?) are going to do their thing. This isn’t to say we shouldn’t be worried about AI in general — there’s a ton that could go wrong. But energy use isn’t going to be the problem.

OK, as usual I’ve gone way longer on this than anyone is going to read. But it’s endlessly fascinating to be here during this moment of innovation. It’s just unfortunate that it happens to overlap with with existential threats to our American experiment. That part sucks.