“Doing my own research”

To be clear, the title here is tongue-in-cheek. Real “research” involves carefully-designed and bias-controlled experiments, and there ain’t none of that below. My intended point is just that we’re all capable of digging deeper in ways that haven’t been the case before the advent of LLMs. Arming yourself with these tools is one way to fight the bullsh*t that is pushed at us every single hour of every single day.

A few days ago the Algorithm-capital-A pushed me a video about Bass Pro Shops and how they scam tax discounts by creating fake “museums” in their stores. Turns out that while the shock video version exaggerates the scope of the con, it’s basically true. Nice!

Anyways, what started as a casual attempt to test the veracity of this story ended up as something much more interesting. Yes kids, it’s another AI-positive story, this one hidden behind some observations on the American economy.

Subsidy Tracker

One of the articles about Bass included a link to Subsidy Tracker, a site that combs through public records to identify federal, state and local subsidies by company. This is really messy data; we’re lucky there are non-profits making it usable.  

Somehow I wandered from Bass over to the airline industry, where I found a ton of very recent federal grants —millions of dollars every month. Digging into these led me to the Essential Air Service program, and that started me down today’s rabbit hole. Bear with me for a second.

Essential Air Service

See, back in 1978 Jimmy Carter — yes, JIMMY CARTER — signed the Airline Deregulation Act, hoping to decrease fares and increase service by rolling back a bunch of controls on fares and routes. But the bill’s authors realized that without some new intervention, a deregulated airline industry would immediately drop service to smaller, less profitable locations like, say, my college home airport in Lebanon, NH.

They addressed this by creating the EAS and its list of “Essential Air Service Communities.” Airlines are paid real cash money by the federal government to provide regular service to these communities — to the tune of more than half a billion dollars in 2024. For example, Cape Air was paid $5.2M to ensure 54 people a day could fly one-way to or from West Leb. That’s about $2,400 per leg, even if they fly the plane empty!

And you know what? This is fine. Actually, it’s great. We, as a society, decided that we cared about maintaining integration of our rural communities with the rest of the country via passenger air. We also recognized that free market dynamics would not deliver this outcome, because the societal “cost” of not having service was borne outside of the immediate commercial players.  

Of course there are risks to this. Collective actions are complicated and always subject to bias and graft — they’re never “optimal.” Our protections are mandated transparency, civil education and a free press. The EAS probably needs some tweaks, but on balance it seems like a pretty good call.

Like it or not, this kind of market-socialism hybrid has been our model pretty much forever — and increasingly so as we’ve become more interdependent through the industrial and information ages.

OK, Cool, Right?

Not so fast, Milton. A huge, possibly majority fraction of our country simply does not understand this long-standing reality. The Reds have spent decades — starting with talk radio in the 80s and culminating with MAGA today — telling people that we live in a perfectly free market economy, and that perfect freedom is the primary reason for the success of our nation. It’s a two-part strategy:

  1. Emphatically label “bad” collective societal action as “communist.” (health care, minimum wage, food and unemployment benefits, UBI, …)
  2. Ignore, bury and obfuscate the “good” action so the public doesn’t notice the hypocrisy. (corporate subsidies, military adventures, incumbent-benefitting pork, …)

The EAS is a great example of this. By definition the vast majority of EAS communities are in rural areas — places that likely supported Trump in the last election. But I’m pretty sure that if you asked residents in those communities if the government was playing to fly empty planes to and from their homes, they’d say (1) no way, and/but (2) we don’t want to give up our airport.

Ask a Simple Question

At this point in the story, I realized I should check my own bias. I mean, of course rural voters went for Trump, but it’s possible that EAS communities were somehow an outlier. So I started poking around for some data that would help me answer that question.

Little asks like this seem so simple! But as anybody who has ever tried to report on real-world data can tell you (say, for example, the DOGE wizards that “concluded” millions of dead people were drawing social security) it’s actually super-hard. First you have to find data — and for a lot of questions, that just doesn’t exist (see my comment at the top about real research), or it’s in an awkward or inconvenient form for analysis. In this case, however, it was pretty easy:

  1. The Dept of Transportation publishes a current list of EAS communities. It’s a PDF, but that’s easy to extract into a CSV file with columns for city and state.
  2. The Harvard Dataverse, another great resource that I hope survives our current funding climate, publishes county-level election data (file citation).

Progress! Often all you need from here is a little basic Excel magic (see here for some tips on that). Unfortunately for us, we hit our first stumbling block: election data is reported at the county level, while the EAS communities are cities. Mapping between those will take a little more data, but luckily that’s available too, compiled from government sources and released under a Creative Commons license: simplemaps US Zip Codes database.

Extract city, state and county columns from this file, match up the city/state with the EAS data, walk that through county to the election data, and Bob’s Your Uncle!

Finally, the AI Part

Well sure, it’s pretty simple in theory. But most of the country doesn’t have the skills to actually write this code. I mean, I’ve spent a career doing this sort of thing, but even so I’m not likely to invest the effort on a random weekend news-scrolling curiosity.

This is where foundational AI models can really change the game for everyone. It’s not without pitfalls, but take a look at what Claude Code was able to do with this prompt:

I’d like to generate a csv file that shows how each county that is considered an eligible community in the Essential Air Service program voted for president in 2024. Please use node and javascript for this script.

Data on EAS eligible communities is in the file eas.tsv. Data that translates city/state to county is in the file uszips.csv. Data that contains county-level presidential elections results is in the file countypres_2000-2024.csv.

You’ll need to read each city/state combination out of eas.tsv, then use uszips.csv to translate that into one or more county/state combinations.

With this information, look up the 2024 election results for those counties, sum up the votes if there are multiple counties, and output a row with the name of the candidate that received the most votes.

If you are unable to translate a city/state to county/state, or if that county/state is not found in the presidential election results, use “unknown” as the name of the winning candidate.

The output should have three columns: the original city/state from the EAS data and then then name of the winning candidate.

Please double-check your work and do not take shortcuts such as estimation or extrapolation. I want to be sure that the data you output represents direct matches only — if the data isn’t clear just say “unknown” and that’s ok.

I put a lot of detail in that prompt because (a) I’d already done the work to figure out data sources; and (b) I wanted to be very clear that the model should be conservative. First try: Winner-Winner-Chicken-Dinner!

More than Mechanical

A machine that writes code to crosswalk a bunch of files is pretty neat, opening up a deeper level of analysis to huge swaths of the population. But it gets really cool when you look under the covers. Review the entire conversation for yourself using this link.

The model wrote code, tested it, and iterated a bunch of times to discover and account for unique quirks in the data. It was a lot! Again, this will sound very familiar to anyone who has tried to do even moderately complex cross-source data analysis:

  1. One file had full state names while the other had abbreviations. Create a lookup table.
  2. The “mode” column is inconsistent. Most counties use “TOTAL VOTES” to represent totals, but some counties leave this blank, others use other terms like “TOTAL VOTES CAST” and others don’t have total rows at all so they need to be created by summing other modes. Normalize the values and created an algorithm that picks the most representative rows.
  3. Some city names were slightly different across files. E.g., “Hot Springs” vs “Hot Springs National Park.” Use partial matching to address.
  4. Spacing and casing differences. Strip spaces and lowercase everything before matching.
  5. Additional differences in punctuation and abbreviation. Use a normalization table.

All of these were found without further prompting or intervention. And as the cherry on top, the model even realized that the two Puerto Rican EAS communities weren’t in the election data because Puerto Ricans can’t vote for president.

Of course, given the state of LLMs today I still wouldn’t just trust the output without reviewing the code and doing some spot checks. In this case at least — did that, and it passed with flying colors.

TLDR, my assumption about Trump voters is backed up by the data. Not earth shattering perhaps, but anything that makes the world a little more fact-based is a Very Good Thing. And most importantly, thanks to LLMs, this kind of research is available to all of us at any time. People love to talk about “brain rot” from AI — but we do that with every innovation. Gen X peeps, remember the uproar about calculators (55378008)? Use it well and it is transformational.

Anyways, if you’re starting your online screed with “I haven’t checked but I bet….” well, shame on you.

OK, but what about Cost and Energy?

It’s very popular to dismiss AI solutions due to their allegedly egregious energy use. The work I did here used 54,116 “tokens” — where a token is a unit of work kind of like a word but not quite. There isn’t a ton of data out there as to how much energy is used during inference, but a broad range between .001 and .01 Watt-hours per 1,000 tokens is cited pretty regularly.

Double that to cover infrastructure costs like cooling, split it down the middle and we can make a crazy rough estimate of .54Wh for the work in this post. That’s about the same as running two Google searches, or running a 10W light bulb for three and a half minutes. To me, this is a shockingly efficient use of energy, even if our guess is off by two or three times.

Ah you say, you can’t just look at inference — model training costs are astronomical. And that is true! But production models typically remain in use for around six to eighteen months before being superseded. Over that timeframe a model will be used for many billions of inferences; training costs quickly amortize to basically zero.

And none of this considers the innovation curve that is already happening to push costs down. Just as with traditional computing power, market forces (ha, get it?) are going to do their thing. This isn’t to say we shouldn’t be worried about AI in general — there’s a ton that could go wrong. But energy use isn’t going to be the problem.

OK, as usual I’ve gone way longer on this than anyone is going to read. But it’s endlessly fascinating to be here during this moment of innovation. It’s just unfortunate that it happens to overlap with with existential threats to our American experiment. That part sucks.

Be the one who knows (Excel edition)

The utter haplessness of so many business leaders amazes me. Somehow we’ve created a culture in which the more “important” you are, the less you actually do for yourself. Folks talk about “staying above the details” or “focusing on the big picture” like that’s a good thing, but for my money it’s just code for lazy and/or ignorant. And honestly, at least in my experience, at the end of the day it’s usually easier to find answers yourself than it is to explain to somebody else what you need. Not always — but often enough.

Nowhere is this more true than with respect to data. Solid decisions want information — what’s driving customer behavior, service outages, cost trends, morale problems, schedule delays, merger options, marketing opportunities? Every knot is easier to untangle when you understand its data context. Do not underestimate the influence you can have just by being the one who knows stuff.

And it’s a lot easier than you think. I love my SQL, but there’s a ton of knowledge hiding behind just a few simple clicks with Excel (or Google sheets or whatever, sure. We all know who really grabbed the torch from Lotus and Multiplan and made the spreadsheet the powerhouse it is today). Seriously, if you spend ten minutes with me here, it’ll pay back just about every day until you retire and sit around spewing entropy into the universe.

Use these links to VIEW or DOWNLOAD the final version of the Excel workbook

0. Tables, rows and columns

“Tables” of data are everywhere — so ubiquitous that we just take them for granted. But 2D grids are really quite remarkable and can help us reason about just about anything. Each row of the table represents an instance of “something,” and each column represents a “feature” of that something. For example, I just randomly grabbed this screenshot of my song queue from the Spotify app:

The “something” in this table is a song. Each of the five columns represents a distinct feature of the song:

  • Position in the queue
  • Thumbnail image
  • Song title and Artist name
  • Album name
  • Duration of the song

Now, before we can do anything interesting with data like this, it needs to be in a format that Excel understands. We’ll look at that in a second, but for now just marvel at how many of these tables you use every day. Weather forecasts, credit card activity, Explorer (or Finder) windows on your laptop, contact lists, invoices, nutrition labels… even your grocery list is a table, albeit with only one column. Rows and rows of data, each described by distinct feature columns. Nerdvana.

1. Get the data

The hardest part about asking your own questions is usually just getting the data. In enterprise settings folks are always trying to “control the message,” so you end up with pretty but mostly static reports and charts that answer the questions they think you want to ask. But every reporting system has an option to export raw data to Excel, or more likely to a “CSV” file. CSV is just a fancy text file in which each line is a row and each column is separated by a comma (CSV == “comma-separated values”). As I’ve talked about before, it’s the workhorse of the modern age — efficient and universally accepted.

When your question involves more public stuff — economic or census data, stock reports, weather, health, etc. — you can almost always find it in CSV format thanks to https://data.gov and its regional counterparts. The Obama-era “open data” initiatives created by folks like Aneesh Chopra, Todd Park, Greg Downing et al were really quite revolutionary; it’s a shame more people aren’t aware of them. I miss the heydays of Health Datapalooza!

For the purpose of this article, let’s ask the question — over the past few years, what’s happened to the cost of energy at my house in Bellevue, WA? Everybody talks about these costs being out of control, but the truth is we haven’t really felt it at home. So what’s up?

First let’s get the data for my house. Thanks again to those Obama years, I can get started by just logging into my Puget Sound Energy account and hitting the “Green Button” hiding at the bottom of the usage page. The download gives me three years of usage data in two files, one for electric and one for gas. Let’s double-click electric and see what happens:

Not bad! But not ideal for analysis either. Here are the first few things I always do to clean things up:

  1. Remove the title rows at the top (Name/Address/etc.) … everything in Excel is easier if your sheet is a uniform table with a single header row. Click and drag over the row numbers one through five to highlight those rows, then right-click and “delete” them.
  2. On the View ribbon, choose “Freeze Panes” and select “Freeze Top Row.” Now when you scroll around in the document, you’ll always be able to see the column headers.
  3. Hit Control-A to select the whole sheet, and then on the “Data” ribbon choose “Filter.” This adds little dropdown controls on each column which we’ll check out later.
  4. With the whole sheet still selected, double-click the vertical line between column letters A and B. This auto-adjusts all of the column widths so that you can see everything in each cell (e.g., you see start and end dates rather than a bunch of number symbols).

Much better!

First Quick Hits

There is a huge payoff just for the little bit of effort we’ve put in so far. First click on the column label F to select all of the cost data. At the bottom of the window, Excel computes some quick statistics for you — e.g., my average monthly bill for the period was $198.88. Click on the “Start Date” dropdown and unselect the years 2021 and 2022, and the average (for 2023 and 2024) actually drops to $196.89.

To look at the trend another way, click the “Start Date” dropdown again and make sure “(Select All)” is checked. Then on the “Insert” ribbon, choose “Recommended Charts” and then “Line Chart.” OK, it’s clear that winters cost more, but is there an overall trend? Click on the chart, then the “+” button on the right, then check “Trendline.” Boom! While our quick average check above hides it, there is a slow upward progression over time. Not enough to freak out about, but a hint.

Formulas

We haven’t looked at gas data yet. Also, I’d really like to go back further than 2021. The Green Button data doesn’t have this, but I can view charts into the past, and transcribing usage and cost data from those is pretty easy. This is another “be the one that knows” lesson — entering data manually feels like it takes forever, so most people avoid it. But the reality was just twenty minutes — time well spent in return for insights others will miss. Get ‘er done.

Next I combined the gas and energy data into a single table, and reorganized the columns a bit. Most interestingly, I added two formulas that normalize out seasonal usage variability.

Excel formulas are a bit of black magic, but with a few patterns you can get a lot of value for a little investment. The basic ideas are:

  • A cell whose contents starts with “=” is a formula — a computed value.
  • Formula inputs are typically other cells, referenced by column letter and row number: “A1” is the top left cell.
  • Standard mathematical operators (+=/*) can be used in a formula. For example “=B3/2” resolves to the value in cell B3 divided by 2.
  • Formulas can also reference functions. For example, “=LEFT(B3,4)” returns the first 4 letters in the value in cell B3.
  • Some functions reference a range of cells, referenced by the top-left and bottom-right cells to include: “=AVERAGE(B3:C10)” computes the average value of cells in columns B and C from rows 3 through 10.

The values in column D (“Electric $/KwH”) were set up like this:

  1. Click on cell D2 and enter the formula “=C2/B2” — dividing the total month’s cost by the number of kilowatt hours used.
  2. With cell D2 selected, scroll all the way to the bottom of the table and shift-click on cell D97.
  3. Hit Control-D to “fill down” the formula from D2 into all of the selected rows.
  4. Click the “$” button in the middle of the “Home” ribbon to format all of these values as dollars.

Here’s the magic — somehow, every formula you just created is “correct” — that is, column D in row 50 references columns B and C from row 50 as well. How did that happen? It turns out that when you copy a formula, the input values are automatically updated so that they are in the same relative — not absolute — position as in the original formula. When we typed “=C2/B2” in cell D2, we were really saying “divide the cell one column to the left by the cell two columns to the left.”

Excel can do all kinds of “fills” — but the basic Control-D “fill down” is by far the most useful; you’ll find yourself there again and again. The exact same steps were used to compute the cost of natural gas per CCF (100 cubic feet, a standard measure for residential use) in column G, using the formula “=F2/E2” and filling down.

Line graphs and trendlines for these new normalized values looks pretty similar to what we’ve seen before. There’s a late 2023 spike in gas, but that’s come down and otherwise looks like a very slow, steady upwards trend — not great, but not something to panic about either.

Add some reference data

Washington gets most of its electricity from hydropower, so there are some trends we’re protected from. Let’s see how my home data compares to the rest of the country, courtesy of https://data.gov. A quick search for “electric and gas costs monthly” led me directly to the US Dept of Energy’s “Total Energy” page, which had exactly what we’re looking for.

As always, my first step was to apply the same “freeze/filter/width” steps we used earlier to get familiar with the data. Starting with electric, it looks pretty good — one row for each monthly average, and values in the same KWH units as my home data. Sweet! We could just start charting things right here and eyeball the relationship to usage data. Not a bad idea at all, but let’s go one better and try to correlate the trends more closely. In thinking about that, two issues pop out:

  1. The file contains more than just Residential prices; the dropdown for “Description” also shows values for Commerical, Industrial, etc.. We’ll need to make sure we use only the ones that represent household usage.
  2. Date values are in YYYYMM format. My usage data is keyed on the specific day that starts my billing cycle. Correlating these is going to take a bit of work.

While you can reference data between spreadsheet files, it’s a huge pain and almost always breaks. Much better is to add multiple “sheets” inside of your file, which are managed with the little tabs and controls at the bottom-left of the window (each file starts with a single “sheet” named “Sheet1”). Use Control-A / Control-C in the electric data to copy it all, then in your usage file (1) Click the “+” sign at the bottom-left to create a new tab; (2) Control-V to paste in your data; then (3) right-click the tab for your new sheet, choose “Rename” and call it “us-elec.” You could just leave the default names, but things are easy to keep track of with something mnemonic.

Combine the data with VLOOKUP

The next trick is to create a column in our table that represents the corresponding US average, so we can easily chart them together. This is going to require the biggest guns of our adventure, primarily a function called “VLOOKUP” (yes I know about XLOOKUP but I’m old). I’m going to try to balance explaining this all with keeping our collective sanity — for deeper dives there a ton of fantastic books and tutorials out there. Again, the good news is that 99% of the time a few patterns will get you where you want to be.

The basic idea here is a formula that “looks up” data in another sheet using a per-row key that connects the values. In our workbook, that looks like this:

First, we need to set up a key value that matches the reference data (YYYYMM). Insert an empty column into our usage data by right-clicking on the Column B header, choosing “Insert,” and naming the column “YYYYMM.” Enter this formula into column E2, using a bit of simple math to end up with the format we need:

=(YEAR(A2)*100)+MONTH(A2)

Select the column from B2 to the bottom and use Control-D to fill down. Progress! Now we need to use this column to “find” the correct value in the us-elec sheet. To make that happen, insert a column after “Electric $/KwH” and (don’t freak out) add this function in cell F2:

=VLOOKUP(B2,'us-elec'!$B$2:$C$639, 2, FALSE)/100

The first parameter to VLOOKUP is the key value — our computed value in B2. The next one is, let’s be honest, a bit of a hot mess. What we’re trying to do is to identify the “range” of cells that contains our target data. It breaks down like this:

  • ‘us-elec’! indicates that the data is on the tab called “us-elec”.
  • Cells B2:C639 on that tab contain the residential energy data we care about. The first column on this range must contain the key values.
  • The $ signs tell Excel that this range is “absolute” — that is, when we fill down, DON’T adjust the range like we saw earlier.

The third parameter “2” indicates that our target value (the cost data) is in the second column of the range. And finally, the last parameter FALSE says that matches must be “exact.” Don’t worry about this one — you’re pretty much always just going to use FALSE.

Notice that we divide the found value by 100 — the reference data is in cents, not dollars. Careful with stuff like this — doing the wrong thing can make you look pretty stupid!

Fill down this formula to the bottom of column F (you’re starting to get good at this). Control-click the column headers A, E and F, then choose “Recommended Charts” and “Line Chart” on the “Insert” tab:

Interesting — while we definitely have cheaper electricity than the rest of the country, the trend looks pretty darn similar. But check out what happens when we do the same work for natural gas!

What’s with those summer spikes? It turns out that the national average is dominated by deregulated markets in Texas and a few other states — it’s easy to understand why folks get annoyed when prices fluctuate so dramatically. But even smoothing that out with a trendline, the average is rising faster nationally than it is in Washington. There’s always more to love about the PNW: a diverse energy portfolio, proximity to supply, a relatively mild climate, and the big bad WUTC.

One More Trick: (Basic) Pivot Tables

Pivot tables will mess with your mind. They’re awesome, but also pretty fundamentally incomprehensible. At least to me. But that doesn’t mean we can’t use them to our advantage! A few simple patterns make it a breeze to summarize and compare data across dimensions. Let’s check that out.

For most of this exercise we’ve been focused on price-per-unit data, and that’s shown to have been pretty stable. But in the very first chart, we saw that my total bills fluctuated seasonally, which makes intuitive sense — Washington winters are going to require more energy than Washington summers. But some winters are colder and some summers hotter; can we see that in the data too?

To make this easier, we’ll add two more columns to our main sheet; one for the month and one for the year. These are simple formula columns; use the same patterns we’ve already walked through. Now Control-A to select all of the data, and then click “Pivot Table” on the “Insert” tab and accept the defaults. This will create a new sheet — it looks familiar, but rather than editing cells directly we’re going to use the “PivotTable Fields” control on the right:

  1. Drag the “Month” field to the “Rows” box.
  2. Drag the “Year” field to the “Columns” box.
  3. Drag the “Gas Usage (CCF)” field to the “Values” box.
  4. Back on the “Insert” ribbon, choose “Recommended Charts” and “Line”.

Woo hoo — this view really highlights seasonal usage variability:

But once again, it’s pretty stable from year-to-year. The only interesting bit I see is in 2020 and 2021, when total usage was a bit lower because we spent most of those years out on Whidbey Island. But even that gets lost in the muddle of overlapping years in the chart.

So what do we know?

None of this analysis looks at oil or gasoline prices, and those have had their own trajectories. But I can say with some data-backed confidence:

  • Overall energy use in my home has remained relatively stable over the last eight years.
  • Like most other goods I buy, energy costs have risen slowly, but not dramatically.
  • Washington State has some key factors that keep these energy prices lower and more stable than in many other parts of the country.
  • A number of specific incidents (in particularly COVID and the inflationary period in 2022/2023) are clearly visible in the data.

Most importantly, I have a solid “feel” for how things have moved around over the years. I am well-armed in a conversation to rebut extreme or cherry-picked claims. And from a career perspective, I’m going to stand out as a knowledgeable and — assuming I don’t act like an a**hole — valuable member of the team making decisions.

Just do it. Find or ask for the data, open it up Excel or your favorite spreadsheet, and (as the tech bros say) “f*ck around and find out.” It takes way less effort than you think and will pay back far more than you expect. Remember: you don’t need to be a huge data wonk, you just need to know a few key patterns. “Open; Control-A; Freeze; Filter; Widths; Recommended Charts” gets you 80% of the way there. You’re welcome.