Every business is a process shop — a tangled mess of human and automated activities that work together to produce something folks are willing to pay for. And even as AI starts to handle specific jobs and tasks, that inherent complexity doesn’t go away.
Enterprise software is the glue that keeps the machine running, and if you’ve ever been on the hook for it working correctly, you’ve implemented some kind of monitoring solution. Log processors, web pingers, process monitors, “on call” scheduling — there’s an entire industry of software that just watches other software (and people, and AI) to make sure all is well.
And yet, we still get surprised by catastrophic failure — the backup that we thought was happening every night; the SSL certificate we swore was on auto-renewal; the battery-operated door lock that failed-open over the weekend; the ETL job configured to run under that retired guy’s account.
So what do we do? Monitor the monitors, of course. But what if they fail? One of my favorite old saws is turtles all the way down — it seems like there’s no bottom to this stack! That’s why, everywhere I’ve ever been, I’ve built something like backstop. Even in my retired life, it’s an essential tool.
Backstop
The idea behind backstop is to have one authoritative, affirmative check on your world, generally once per day (I run mine about 4am). Backstop is the heartbeat of your enterprise, showing up on schedule with a single, consolidated, explicit, proactive look at everything that matters.
One person needs to expect the backstop email every morning. If it doesn’t appear, silence is not golden: find out why. If it shows errors or warnings, find out why. If anything looks funny, track it down. (Honestly, this person should be your CTO or CIO — nothing else gives better intuition for “how it’s going,” and that awareness is gold.)
This doesn’t obviate the need for any of your other monitoring and alerts — they are more timely and more detailed. Backstop is an assurance that the machine is working and that nothing is falling through the cracks. A good backstop has four critical properties that need careful attention:
1. Bulletproof
The most important feature of a good backstop is that it finishes and reports. Every exception needs to be caught; every hung request needs to timeout. Each metric you’re measuring needs to be checked independently — failure of one check cannot stop evaluation of another (for example, here and here).
This is easy to get wrong, especially because you’re likely to be relying on a bunch of third party libraries to monitor proprietary services and apps. That’s why the human element is absolutely critical. If the backstop email doesn’t show up on schedule, a real person needs to notice and they need to fix it.
2. Complete
Asking a human to check in on dozens (hundreds) of independent subsystems is untenable; a backstop fixes this by creating one single tip to the spear. For that to work, it needs to be a complete look at your environment.
Your best friend in this endeavor will be something like ProcessResource — a type of checker that can run an arbitrary sub-process. While I’m the first to advocate for limited dependency, reality is complex and so is your environment. You surely rely on some system that only has a node or python client library, and another that has its own native client, etc. etc.. In the backstop use case, completeness is more important than consistency, so hold your nose and script away.
It’s also important to evolve “completeness” over time. Of course adding support for new systems, but also catching up old ones. Unless you suck at your job, most outages reveal new failure modes — adding new checks to your backstop should be a routine part of your post-mortem process.
3. Clean
Nothing spikes my blood pressure quite like somebody saying “oh that happens all the time, we just ignore it.” It’s not just lazy, it’s corrosive — not only will your “real” alerts get lost in the noise, but the “ignorable” ones are almost always worse under the covers than you think.
You have to be able to see what’s wrong. If you’ve accidentally coded a bad metric, change or remove it. If something is time-bound — e.g., you’ve already set a plan to fix it on a specific date in the future — implement a pause that wakes up if that date is missed. But under no circumstances can you allow errors and warnings to persist over time. Please trust me on this.
4. Actionable
Last — every resource you track should include a link that gets you to the right place to investigate and learn more. By definition a backstop problem is an exception, which means a disruption to your carefully-curated calendar of stupid meetings. It’s imperative that you can dive in quickly and figure out what’s up.
This link can be a lot of things — a more detailed look at the resource itself; a pre-filled form to open a trouble ticket; a diagnosis cookbook on a wiki; whatever works. But especially if you have a junior or specialized engineer looking at the backstop error list, knowing where to start can make a huge difference.
Important Metrics
Age / Activity

This is probably the most important backstop metric, because it’s the one that is most often missed by traditional monitors. Some process (or a monitor!) just stops working, but we don’t notice until it’s too late, because silence seems golden.
These failures also tend to create the worst headaches, because they cause damage over time. Backups that don’t get done, key indicators missing critical inputs, that kind of thing.
Trigger Dates

A bunch of processes happen on what I call the “slow clock” — stuff you have to do every quarter, every year or even every few years. In my retired life these are things like renewing my driver’s license or cleaning the air filter in my furnace. In enterprises they’re more like audits, disaster recovery exercises, and domain renewals. Calendars help with these, but slow clock reminders can get lost amongst daily meetings and more immediate events.
Levels

Things rarely fall apart overnight — they slowly degrade, unnoticed, over time. Smoke alarm batteries are a great example, and the water level in our community storage tank.
When these alert at night or in the middle of the day, busy humans tend to ignore them (“I’ll get to that later”). But as a backstop metric, they become visible in the right context — at the right time, together with other outstanding issues.
Availability

This is the OG monitoring classic: is the web server responding? And if you’re fancy, can you perform basic tasks like login or search? These aren’t usually the most important backstops, but they can be useful checks, especially for lesser-used services that otherwise are ignored until the moment they become critical.
My Backstops, aka Code is All That Matters
I’ve written my own backstop harness because, well, I get to choose. I actually don’t know of a commercial or open source tools that really does this job, but there probably is one. Mine is written in Java; it’s free to use and modify on Github. If you’ve got a system with git, java and maven installed you can try it out like this:
git clone https://github.com/seanno/shutdownhook.git
cd shutdownhook/toolbox
mvn clean package install
cd ../backstop
mvn clean package
java -cp target/backstop-1.0-SNAPSHOT.jar \
com.shutdownhook.backstop.App \
config-demo.json PRINT
You’ll see a bit of log output but then most importantly a couple of lines like this:
OK,Google,,2138 ms response
OK,Proof of Life,,I ran, therefore I am.
The “demo” configuration file contains two resources: one that simply reports back “OK” and one that checks availability of https://google.com. The “PRINT” argument tells the app to just output to console rather than sending an email.
What’s Going On Here
The code is pretty simple, and purposefully so — its job is to be rock-solid and always, always, send an email at the end. Plus, we want to collect as much useful information as we can, so failures in one resource can’t impact the others.
Configuration starts with a list of “resources”, each defined by a name, url, java class name and map of class-specific parameters. A resource class must implement the Checker interface, doing whatever it needs to and returning results as zero or more Status objects, where zero means all is well. Checkers also have access to a convenience object offering common services like web requests and JSON management.
In the normal case, the entrypoint in Backstop.java just: (1) uses an Executor pool to tell the checkers to do their things; (2) collects and sorts the Status responses into a single list with the worst offenders at the top; and (3) Uses Azure to send an HTML email with the results.
Again, you’ll notice a ton of defensive code throughout — Backstop is a special snowflake.
Not counting my favorite existential DescartesResource, so far I’ve implemented five resource checker types for my personal use:
- ProcessResource just runs an external process. It’s great catchall for all of the weirdo devices and services out there. Currently I’m using this one to monitor my Ring cameras and alarms, as well as the LoRaWAN sensor monitoring our community water tank on Whidbey.
- TempStickResource keeps an eye on the sensor inside our Garage freezer.
- TempestResource monitors my home weather station.
- WebResource takes care of web availability — the old standby.
- TriggerResource is for those slow clock events; more on this below.
TriggerResource
This resource type reads “slow clock” events out of a Google Spreadsheet and alerts when deadlines are approaching or past. The best way to get a sense of this is to look at a few items from my household triggers:

My dog Copper needs his flea and tick pill once a month and we always used to forget. I’m secretary of our community HOA on Whidbey and that means some paperwork every year. My beloved electric boat has old-school batteries that need topping off once in awhile, and my license is going to expire next year.
The trigger resource code simply loads up rows from a spreadsheet like this and checks to see if each due date is past (ERROR) or upcoming within an optional WARNING period.
While many of these are recurring, the sheet isn’t smart about that. Once a row “fires”, the only ways to turn it off are to edit the spreadsheet (using the link from the backstop email) and change the “Due Date” to the next occurrence OR add a “Snooze Until” date.
Snooze is useful for things like my license — I set up my appointment for next month, so until then there’s no reason to pollute my backstop list. As simple as this is, I find it pretty transformational. Adulting is chock full of stupid things you’re supposed to remember — maybe you’ll get a reminder or maybe not. A backstop trigger list is the perfect security blanket.
Sending Email
I’ve chosen to use Azure Communication Services to send the backstop email. SMTP used to be so easy — but that was before spam and phishing and all the other nasties that took advantage of its simplicity. These days, reliably sending email that doesn’t land right in the Junk folder is a big hassle. Azure makes this pretty easy, and it’s dirt cheap — less than a dollar a year for once-a-day emails!
I don’t love the dependency, but it seems like the right balance.
Deployment and Logistics
The “last mile” for backstop is deciding where it should run and how it should be triggered. It is not a resource-intensive operation, so the old school option isn’t a bad one: dedicate a single small server or VM to the job, triggered with cron once a day. Sorted!
But this simplicity does come with a big downside — patching that server and keeping it up to date. In an enterprise you may already have good infrastructure for this, and if so go for it. But in my world, servers left on their own tend to decay over time.
I’ve tried to avoid this by using a couple of Azure services to do the job for me. The first I like a lot — the script docker-build.sh creates a container that runs in Azure Container Instances without a dedicated server. The container does its thing and then shuts down, so it’s also dirt cheap, just pennies a month.
That leaves just the cron part — something has to trigger the container to run every morning. I’m pretty surprised this isn’t just part of ACI, but it’s not. The solution I landed on is a timer-based Azure Function. My function uses a cron-style schedule to run each morning, scripting a start to the proper container.
This was a bear to get right. I’m not going to let myself spiral into yet another rant about how poor the Azure developer experience can be — just know it is rubbish. You know who really helped out here? My good friend Claude; way better than any Azure help resource I could find. Whew.
There’s Always Another Resource
I have a pretty long list of resources I’m planning to add to my backstop:
- FLO whole-home water shutoff
- Various GE appliances, in particular for rinse-aid in the dishwasher (finally we’re getting down to the real problems)
- Tesla Powerwall and Enphase panels/inverters
- More shutdownhook demo apps
- The Rivian!
- Electric, water and gas usage
- … and on and on …
Our lives and our enterprises are pretty complicated — and every new piece of smart technology that seems (and is) so great carries its own tax. Servers, services, accounts, batteries, it adds up. To keep things humming you really do need a backstop. A single tip of the spear from which all of the mess can be corralled and observed. I hope you’ll give it a try — with my code or your own. Until next time!

































