Turtles all the way down

Every business is a process shop — a tangled mess of human and automated activities that work together to produce something folks are willing to pay for. And even as AI starts to handle specific jobs and tasks, that inherent complexity doesn’t go away.

Enterprise software is the glue that keeps the machine running, and if you’ve ever been on the hook for it working correctly, you’ve implemented some kind of monitoring solution. Log processors, web pingers, process monitors, “on call” scheduling — there’s an entire industry of software that just watches other software (and people, and AI) to make sure all is well.

And yet, we still get surprised by catastrophic failure — the backup that we thought was happening every night; the SSL certificate we swore was on auto-renewal; the battery-operated door lock that failed-open over the weekend; the ETL job configured to run under that retired guy’s account.

So what do we do? Monitor the monitors, of course. But what if they fail? One of my favorite old saws is turtles all the way down — it seems like there’s no bottom to this stack! That’s why, everywhere I’ve ever been, I’ve built something like backstop. Even in my retired life, it’s an essential tool.

Backstop

The idea behind backstop is to have one authoritative, affirmative check on your world, generally once per day (I run mine about 4am). Backstop is the heartbeat of your enterprise, showing up on schedule with a single, consolidated, explicit, proactive look at everything that matters.

One person needs to expect the backstop email every morning. If it doesn’t appear, silence is not golden: find out why. If it shows errors or warnings, find out why. If anything looks funny, track it down. (Honestly, this person should be your CTO or CIO — nothing else gives better intuition for “how it’s going,” and that awareness is gold.)

This doesn’t obviate the need for any of your other monitoring and alerts — they are more timely and more detailed. Backstop is an assurance that the machine is working and that nothing is falling through the cracks. A good backstop has four critical properties that need careful attention:

1. Bulletproof

The most important feature of a good backstop is that it finishes and reports. Every exception needs to be caught; every hung request needs to timeout. Each metric you’re measuring needs to be checked independently — failure of one check cannot stop evaluation of another (for example, here and here).

This is easy to get wrong, especially because you’re likely to be relying on a bunch of third party libraries to monitor proprietary services and apps. That’s why the human element is absolutely critical. If the backstop email doesn’t show up on schedule, a real person needs to notice and they need to fix it.  

2. Complete

Asking a human to check in on dozens (hundreds) of independent subsystems is untenable; a backstop fixes this by creating one single tip to the spear. For that to work, it needs to be a complete look at your environment.

Your best friend in this endeavor will be something like ProcessResource — a type of checker that can run an arbitrary sub-process. While I’m the first to advocate for limited dependency, reality is complex and so is your environment. You surely rely on some system that only has a node or python client library, and another that has its own native client, etc. etc.. In the backstop use case, completeness is more important than consistency, so hold your nose and script away.

It’s also important to evolve “completeness” over time. Of course adding support for new systems, but also catching up old ones. Unless you suck at your job, most outages reveal new failure modes — adding new checks to your backstop should be a routine part of your post-mortem process.

3. Clean

Nothing spikes my blood pressure quite like somebody saying “oh that happens all the time, we just ignore it.” It’s not just lazy, it’s corrosive — not only will your “real” alerts get lost in the noise, but the “ignorable” ones are almost always worse under the covers than you think.

You have to be able to see what’s wrong. If you’ve accidentally coded a bad metric, change or remove it. If something is time-bound — e.g., you’ve already set a plan to fix it on a specific date in the future — implement a pause that wakes up if that date is missed. But under no circumstances can you allow errors and warnings to persist over time. Please trust me on this.

4. Actionable

Last — every resource you track should include a link that gets you to the right place to investigate and learn more. By definition a backstop problem is an exception, which means a disruption to your carefully-curated calendar of stupid meetings. It’s imperative that you can dive in quickly and figure out what’s up.

This link can be a lot of things — a more detailed look at the resource itself; a pre-filled form to open a trouble ticket; a diagnosis cookbook on a wiki; whatever works. But especially if you have a junior or specialized engineer looking at the backstop error list, knowing where to start can make a huge difference.

Important Metrics

Age / Activity

This is probably the most important backstop metric, because it’s the one that is most often missed by traditional monitors. Some process (or a monitor!) just stops working, but we don’t notice until it’s too late, because silence seems golden.

These failures also tend to create the worst headaches, because they cause damage over time. Backups that don’t get done, key indicators missing critical inputs, that kind of thing.

Trigger Dates

A bunch of processes happen on what I call the “slow clock” — stuff you have to do every quarter, every year or even every few years. In my retired life these are things like renewing my driver’s license or cleaning the air filter in my furnace. In enterprises they’re more like audits, disaster recovery exercises, and domain renewals. Calendars help with these, but slow clock reminders can get lost amongst daily meetings and more immediate events.

Levels

Things rarely fall apart overnight — they slowly degrade, unnoticed, over time. Smoke alarm batteries are a great example, and the water level in our community storage tank.

When these alert at night or in the middle of the day, busy humans tend to ignore them (“I’ll get to that later”). But as a backstop metric, they become visible in the right context — at the right time, together with other outstanding issues.

Availability

This is the OG monitoring classic: is the web server responding? And if you’re fancy, can you perform basic tasks like login or search? These aren’t usually the most important backstops, but they can be useful checks, especially for lesser-used services that otherwise are ignored until the moment they become critical.

My Backstops, aka Code is All That Matters

I’ve written my own backstop harness because, well, I get to choose. I actually don’t know of a commercial or open source tools that really does this job, but there probably is one. Mine is written in Java; it’s free to use and modify on Github. If you’ve got a system with git, java and maven installed you can try it out like this:

git clone https://github.com/seanno/shutdownhook.git
cd shutdownhook/toolbox
mvn clean package install
cd ../backstop
mvn clean package
java -cp target/backstop-1.0-SNAPSHOT.jar \
    com.shutdownhook.backstop.App \
    config-demo.json PRINT

You’ll see a bit of log output but then most importantly a couple of lines like this:

OK,Google,,2138 ms response
OK,Proof of Life,,I ran, therefore I am.

The “demo” configuration file contains two resources: one that simply reports back “OK” and one that checks availability of https://google.com. The “PRINT” argument tells the app to just output to console rather than sending an email.

What’s Going On Here

The code is pretty simple, and purposefully so — its job is to be rock-solid and always, always, send an email at the end. Plus, we want to collect as much useful information as we can, so failures in one resource can’t impact the others.

Configuration starts with a list of “resources”, each defined by a name, url, java class name and map of class-specific parameters. A resource class must implement the Checker interface, doing whatever it needs to and returning results as zero or more Status objects, where zero means all is well. Checkers also have access to a convenience object offering common services like web requests and JSON management.

In the normal case, the entrypoint in Backstop.java just: (1) uses an Executor pool to tell the checkers to do their things; (2) collects and sorts the Status responses into a single list with the worst offenders at the top; and (3) Uses Azure to send an HTML email with the results.

Again, you’ll notice a ton of defensive code throughout — Backstop is a special snowflake.

Not counting my favorite existential DescartesResource, so far I’ve implemented five resource checker types for my personal use:

TriggerResource

This resource type reads “slow clock” events out of a Google Spreadsheet and alerts when deadlines are approaching or past. The best way to get a sense of this is to look at a few items from my household triggers:

My dog Copper needs his flea and tick pill once a month and we always used to forget. I’m secretary of our community HOA on Whidbey and that means some paperwork every year. My beloved electric boat has old-school batteries that need topping off once in awhile, and my license is going to expire next year.

The trigger resource code simply loads up rows from a spreadsheet like this and checks to see if each due date is past (ERROR) or upcoming within an optional WARNING period.

While many of these are recurring, the sheet isn’t smart about that. Once a row “fires”, the only ways to turn it off are to edit the spreadsheet (using the link from the backstop email) and change the “Due Date” to the next occurrence OR add a “Snooze Until” date.

Snooze is useful for things like my license — I set up my appointment for next month, so until then there’s no reason to pollute my backstop list. As simple as this is, I find it pretty transformational. Adulting is chock full of stupid things you’re supposed to remember — maybe you’ll get a reminder or maybe not. A backstop trigger list is the perfect security blanket.

Sending Email

I’ve chosen to use Azure Communication Services to send the backstop email. SMTP used to be so easy — but that was before spam and phishing and all the other nasties that took advantage of its simplicity. These days, reliably sending email that doesn’t land right in the Junk folder is a big hassle. Azure makes this pretty easy, and it’s dirt cheap — less than a dollar a year for once-a-day emails!

I don’t love the dependency, but it seems like the right balance.

Deployment and Logistics

The “last mile” for backstop is deciding where it should run and how it should be triggered. It is not a resource-intensive operation, so the old school option isn’t a bad one: dedicate a single small server or VM to the job, triggered with cron once a day. Sorted!

But this simplicity does come with a big downside — patching that server and keeping it up to date. In an enterprise you may already have good infrastructure for this, and if so go for it. But in my world, servers left on their own tend to decay over time.

I’ve tried to avoid this by using a couple of Azure services to do the job for me. The first I like a lot — the script docker-build.sh creates a container that runs in Azure Container Instances without a dedicated server. The container does its thing and then shuts down, so it’s also dirt cheap, just pennies a month.

That leaves just the cron part — something has to trigger the container to run every morning. I’m pretty surprised this isn’t just part of ACI, but it’s not. The solution I landed on is a timer-based Azure Function. My function uses a cron-style schedule to run each morning, scripting a start to the proper container.

This was a bear to get right. I’m not going to let myself spiral into yet another rant about how poor the Azure developer experience can be — just know it is rubbish. You know who really helped out here? My good friend Claude; way better than any Azure help resource I could find. Whew.

There’s Always Another Resource

I have a pretty long list of resources I’m planning to add to my backstop:

  • FLO whole-home water shutoff
  • Various GE appliances, in particular for rinse-aid in the dishwasher (finally we’re getting down to the real problems)
  • Tesla Powerwall and Enphase panels/inverters                   
  • More shutdownhook demo apps
  • The Rivian!
  • Electric, water and gas usage
  • … and on and on …

Our lives and our enterprises are pretty complicated — and every new piece of smart technology that seems (and is) so great carries its own tax. Servers, services, accounts, batteries, it adds up. To keep things humming you really do need a backstop. A single tip of the spear from which all of the mess can be corralled and observed. I hope you’ll give it a try — with my code or your own. Until next time!

Zero to Launch

I want to be clear up front that I’m not a “methodology” guy. Whatever the hype, software methodology is inevitably either (a) full employment for consultants; (b) an ego trip for somebody who did something good one time under specific circumstances and loves to brag about it; or (c) both. I’ve built software for decades and the basics haven’t changed, not even once.

  • Make decisions.
  • Break big things down into little things.
  • Write everything down.
  • Use a bug database and source control.
  • Integrate often.

With that said, the rest of this might sound a little bit like software methodology. You have been warned!

Crossing the Ocean

I spend a bit of time these days mentoring folks — usually new startup CTOs that are figuring out how to go from nothing to a working v1 product. “Zero to Launch” is a unique, intense time in the life of a company, and getting through it requires unique (sometimes intense) behaviors. In all cases the task at hand is fundamentally underspecified — you’re committing to build something without actually knowing what it is. In bounded time. With limited resources. Who even does that? Startup CTOs, baby.

Like an ocean crossing, getting from zero to launch is a long journey that requires confidence, faith and discipline. There are few natural landmarks along the way, but there are patterns — the journey breaks down into three surprisingly clear and consistent phases:

  1. What are we building anyways?
  2. Holy crap this is way bigger than we thought!
  3. Will this death march ever end?

Hopefully, one reason you have the job is that you know how to code fast and well — CTOs that can’t (or don’t) code drive me up the wall. And you’d better hire great people. But you’re going to need more than just those chops to get to launch. Each phase needs a different set of skills and behaviors; let’s dig in.

What are we building anyways?

You likely have two constituencies telling you what your software needs to do: (1) non-technical co-founders that see a market opportunity; and (2) users or potential users that want you to help them accomplish something. Each of these perspectives is essential, and you’d probably fail without them. But don’t be fooled — they are not going to give you clear requirements. They just aren’t. They think they are, but they’re wrong.

The first mistake you can make here is getting into a chicken-and-egg battle. Your partners ask for a schedule, you say you can’t do that without full requirements, they say they already did that, you point out the gaps, they glaze over, repeat, run out of money, everyone goes home. Don’t do that.

Instead, just understand and accept that it is up to you to decide what the product does. And further, that you’ll be wrong and folks will (often gleefully) point that out, and you’re just going to have to suck it up. This is why you hire program managers, because synthesizing a ton of vague input into clarity is their core competency — but it’s still on you to break ties and make judgment calls with incomplete information.

And I’m not just talking about invisible, technical decisions. I’m talking about stuff like (all real examples):

  • Does this feature need “undo” capability? If so how deep?
  • Do we need to build UX for this or can we just have the users upload a spreadsheet?
  • Can we cut support for Internet Explorer? (ok finally everyone agrees on that one)
  • What data needs to be present before a job can be submitted?
  • Does this list require paging? Filtering? Searching?

You get the idea. This can be hard even for the most egocentric among us, because really, what do we know about [insert product category here]? Even in a domain we know well, it’s a little bold. But there are two realities that, in almost every case, make it the best strategy:

  1. Nobody knows these answers! I mean sure, do all the research you can, listen, and don’t be stupid. But at the end of the day, until your product is live in the wild, many of these are going to be guesses. Asking your users or CEO to make the guess is just an indirection that wastes time. Take whatever input you can, make a call, consider how you’ll recover if (when) it turns out you were wrong, and move on.
  2. Normal people just aren’t wired to think about error or edge cases. For better or worse, it’s on you and your team to figure out what can go wrong and how to react. This is usually an issue of data and workflow — how can you repair something that has become corrupt? “Normal” people deal with these problems with ad-hoc manual intervention, which is a recipe for software disaster.

For this to work, you need to be obsessively transparent about what you’re building. Write down everything, and make sure all of your stakeholders have access to the documents. Build wireframes and clickthrough demos. Integrate early and often, and make sure everybody knows where the latest build is running and how they can try it. This isn’t a CYA move; that’s a losing game anyways. It’s about trying to make things real and concrete as early as possible, because people are really good at complaining about things they can actually see, touch and use. You’re going to get a ton of feedback once the product is live — anything you can pull forward before launch is gold. Do this even when it seems embarrassingly early. Seriously.

Transparency also gives people confidence that you’re making progress. As they say, code talks — a live, running, integrated test site is what it is. No magic, no hand-waving. It either works or it doesn’t; it has this feature or it doesn’t; it meets the need or it doesn’t. Seeing the product grow more complete day by day is incredibly motivating. Your job is to will it into existence. This is a key but often unstated startup CTO skill — you need to believe, and help others believe, during this phase.

Holy crap this is way bigger than we thought!

Once you’ve gotten over the first hump and folks have something to look at, things really start to heat up. Missing features become obvious. “Simple” tasks start to look a lot less simple. It can get overwhelming pretty quickly. And that’s just the beginning. Over on the business side of things, your colleagues are talking to potential customers and trying to close sales. Suddenly they desperately need new bells and whistles (sometimes even whole products) that were never on the table before. Everything needs to be customizable and you need to integrate with every other technology in the market. Sales people never say “no” and they carry a big stick: “Customers will never buy if they don’t get [insert one-off feature here].”

Herein we discover another problem with normal people: they have a really hard time distilling N similar instances (i.e., potential customers) into a single coherent set of features. And frankly, they don’t really have much incentive to care. But it’s your job to build one product that works for many customers, not the other way around.

During this phase, your team is going to get really stressed out, as every solved problem just seems to add three new ones on the pile. They’re going to want to cut, cut, cut — setting clear boundaries that give them a chance to succeed. This is an absolutely sane reaction to requirement chaos, but it’s on you to keep your team from becoming a “no” machine.

A useful measure of technical success is how often you are able to (responsibly) say “yes” to your stakeholders. But saying “yes” doesn’t mean you just do whatever random thing you’re told. It means that you’re able to tease out the real ask that’s hiding inside the request, and have created the right conditions to do that. It’s very rare that somebody asks you to do something truly stupid or unnecessary. Normal people just can’t articulate the need in a way that makes software sense. And why should they? That’s your job.

During this phase, you have to be a mediator, investigator, translator and therapist. Try to be present at every feature review, so you can hear what the business folks and users say first-hand. If you can’t be there, schedule a fast follow-up with your team to discuss any new asks while they’re still fresh. Never blind-forward requests to your team. Propose simpler alternatives and ask why they won’t (or will) work. Use a cascading decision tree:

  1. What is the real ask? If you’re getting an indirect request through sales, ask them to replay the original conversation exactly — what words were used? If it’s coming from users, ask them to walk you through how they think the feature should work, click by click. Ask what they do now. What do other similar products do? Try to find other folks making the same ask — how do they word it?
  2. Do we need to do anything? Sometimes new asks are just a misunderstanding about what the product already does. As they say in Hamilton, “most disputes die and no one shoots.”
  3. Do we need to do something right now? Beyond just schedule, there are good reasons to delay features to “vNext” — you’ll know more once you’re live. Do we really need this for launch, or can it wait? One caveat here — be careful of people who want to be agreeable. I remember one company in particular where the users would say “it’s ok, we don’t need that,” but then go on to develop elaborate self-defeating workarounds on their own. It took awhile to get everyone on the same page there!
  4. Can we stage the feature over time? This is often the best place for things to end up. Break the request down into (at least) two parts: something simpler and easier for launch and a vNext plan for the rest. You’ll learn a ton, and very (very) often the first version turns out to be more than good enough. Just don’t blow off the vNext plan — talk it out on the whiteboard so you don’t have to rebuild from scratch or undo a bunch of work.
  5. Is there something else we can swap for? Sometimes yes, sometimes no. And don’t turn stakeholder conversations into horse trading arguments. But costs are costs, and if you can remove or delay something else, it makes launch that much closer. Again, you’re always learning, and there’s no honor in “staying the course” if it turns out to be wrong. Be smart.

This phase is all about managing up, down and sideways. Things will get hot sometimes, and people will be frustrated. Reinforce with your stakeholders that you’re not just saying “no” — you’re trying to figure out how to say “yes.” Remind your team that you understand the quantity-quality-time dilemma and that if there’s a fall to be taken, it’s on you not them. And tell your CEO it’s going to be OK … she’ll need to hear it!

Will this death march ever end?

You might notice that, so far, I haven’t mentioned “metrics” even once. That’s because they’re pretty much useless in the early stages of a product. Sorry. Products start out with one huge issue in the database: “build v1.” That becomes two, then four, and suddenly you’re in an exponential Heather Locklear shampoo commercial. New features come and go every day. Some are visible and quantifiable, but many are not. You are standing in for metrics at first — your gut and your experience. Read up on proton pump inhibitors my friend.

But as you get closer to launch, this balance shifts. Requirement changes slow down, and issues tend to look more like bugs or tasks — which tends to make them similar in scope and therefore more comparable. There’s some real comfort in this — “when the bug count is zero, we’re ready to launch” actually means something when you can measure and start to predict a downward trend.

But things get worse before they get better, and sometimes it feels like that downward shift will never happen. This is when the most grotty bugs show up — tiny miscommunications that blow up during integration, key technology choices that don’t stand up under pressure, missing functionality discovered at the last minute. Difficult repros and marathon debugging sessions suck up endless time and energy.

The worst are the bug pumps, features that just seem to be a bundle of special-cases and regressions. I’ve talked about my personal challenge with these before — because special-cases and regressions are exactly the symptoms of poor architecture. Very quickly, I start to question the fundamentals and begin redesigning in my head. And, sometimes, that’s what it takes. But just as often during this phase, you’re simply discovering that parts of your product really are just complicated. It’s important to give new features a little time to “cook” so they can settle out before starting over. Easy to say, tough to do!

During this home stretch, you need to be a cheerleader, mom and grandpa (please excuse the stereotypes, they’re obviously flawed but useful). A cheerleader because you’re finding every shred of progress and celebrating it. A mom because you’re taking care of your team, whatever they need. Food, tools and resources, executive air cover, companionship, music — whatever. And a Grandpa because you’re a calming presence that understands the long view — this will end; it’s worth it; I’ve been there.

I can’t promise your company will succeed — history says it probably won’t. But I can promise that if you throw yourself into these roles, understand where you are in the process, stay focused, hire well and work your butt off, you’ve got a really good chance of launching something awesome. I’m not a religious guy, but I believe what makes humans special is the things we build and create — and great software counts. Go for it, and let me know if I can help.

Write Code

Back in the early 00s I spent some time doing technical due diligence for a couple of VC firms, and I’ve dabbled at it over the years since. A useful assessment covers a ton of ground in not very much time, so it’s pretty important to get to the good stuff quickly. When it comes to evaluating technical leaders / co-founders / CTOs, I’ve found that one question performs better than any other: “How much code do you write these days?”

Giving away the punch line up front, I believe that every great CTO regularly writes production code. By phrasing the question as “how much” — I telegraph the assumption that of course they write at least some. A totally reasonable answer may be a lot or a little, depending on the stage of the company and makeup of the overall team — but it’s never zero.

You talking to me?

I understand that this is not a commonly-held expectation. For some reason, people seem to equate technical growth with getting farther away from code. They say things like “I set direction and need to avoid getting bogged down by details,” or “I’m just too busy managing the team to write code.” I find this incomprehensible. Details make or break success in our business — you simply cannot lead smart people living those details if you don’t honestly understand and share their experiences.

I ask the same question when interviewing and mentoring folks at every level of technical leadership. And look, I’m not saying that all that other stuff isn’t really important — of course it is. A CTO that hides in their office coding all day is not doing their job either. I just don’t see that very often.

Wait you say, I used to write code, I do understand. No you don’t, for two reasons. First, technology moves incredibly fast — core skills like encapsulation and abstraction are evergreen, but everyday patterns and rhythms evolve constantly. Containers, microservices, CD/CI, serverless computing, ML as a service, etc. etc. … you cannot judge their fitness for your purpose unless you really grok how they work, and that depth only comes from hands-on experience. Sales pitches and YouTube videos don’t count. Second, we quickly forget the realities of delivering code for much the same reason that the moms among us forget the painful realities of delivering babies — our brains soften the hard edges of memory over time. (Note this dynamic isn’t all bad; it explains why there are little brothers and sisters in the world.)

Code is Life!

Keeping your head in the details is reason enough to dedicate a few hours a week, but code is the gift that keeps on giving:

1. You’ll ship a bug. It’s more humbling than you may remember, will help make you more empathetic, and it’s a powerful message to your team. Everybody makes mistakes. Suck it up, take personal responsibility, and work hard until the bug is squished and fixed. Also, very little brings a technical team together like making fun of the boss — show them you can take it with grace.

2. You’ll be better at assessing difficulty and challenging estimates. Everything sounds easy when you ignore the details. Everything looks hard when it’s your job to get it done. A CTO is always trying to strike a balance between trusting their team and challenging them. When you can converse about the details of a feature, you’re in a better position to do that.

3. You’ll establish better 1:1 relationships with your team. Your code can’t exist in a vacuum; it will interact with other features and components. Working out those interfaces, debugging cross-system problems — those are opportunities to work in real time with folks across the team, especially those at “lower” levels you might not run into in daily leads meetings and such. Sure you can fabricate interaction with skip level meetings, team events, etc. — fine stuff, but not nearly as sticky as time in the trenches. These are also priceless teaching moments for young engineers.

4. You’ll feel a richer sense of self. Wow does that sound pompous. But how many of us in managerial roles drive home on Friday night wondering what the hell we actually DID all week? We went to meetings, “fought fires” (ugh I hate that phrase), sent email, looked at budgets, had lunch. All of it is important, but it’s impossible to explain to our kids and grandkids, and often even to ourselves. This old guy has come to believe that we are creative beings that need to create things to be happy. Code may be virtual, but it’s most definitely a creative work product. It does stuff. You can re-read it and relive all the tradeoffs, mistakes and dead ends that led to the final result. I don’t think it’s just me — this matters.

Fired Up? Ready to Go!

One last thought before you sudo apt install emacs (of course) and start writing — a word of caution. It’s important that you code; it’s just as important that you pick the right code for your company and team. If you’re really just starting up, go hard and lay as much track as you can — but that’s obvious. If your team is ten or fifteen or so, you probably can still own a key component and do a good job. But much beyond that, stay out of the critical path. Pick something a little out of the way, that isn’t time-critical or going to grind the business to a halt if you don’t get it done when you thought. There are a couple of reasons for this, pretty basic but I’ll call them out anyways:

  • You are going to be distracted by the other parts of your job. That’s just reality — you can’t deliver with the consistency of a dedicated engineer.
  • You are no longer the primary owner of the codebase overall — your leads and senior devs are. Don’t be walking on their lawn uninvited.

One great way for a midsize CTO to spend their coding time starts with log trolling. Pull out production logs and see what is going on — for 100% sure you’re going to find some weird errors in there that happen regularly, but don’t seem to be obviously tied to outages or other error conditions. Track those logs down in source and figure out what they really mean and what must be happening. Sometimes it really will be nothing, in which case you can do everyone a favor and fix the log spam. But many times you’ll find that something truly bad is happening, just not quite above the radar. Maybe it’s a canary for a slowly-growing problem, or is evidence that your end users are having a terrible experience and just assume you suck rather than report it.

Another option is to scan the backlog for feature asks that would really help, but always get postponed. Maybe your configuration system doesn’t hot-reload, or a download format isn’t escaping fields correctly, or users really want SMS alerts rather than just email. Look for things that are relatively self-contained and easily tested, so that you don’t over-burden other team members who are already busy doing their own jobs.

Or whatever — you know your codebase better than I do. Just make sure you dig into the details AND find a way to do it without driving your team (too) crazy. I’m pretty sure you’ll have fun, and I know for absolute sure that you’ll be a better CTO. Woot!