Want to really see what a developer is made of? Watch them debug a tough production issue. One that doesn’t repro on their machine. That has some multi-variate cause related to timing or concurrency. That shows up unpredictably, usually late at night. That isn’t in “their code.”
Seriously — if I could somehow make that part of my recruiting funnel, I would never make a bad hire. The colleagues I choose to work with again and again are the ones I’ve sat with for hours, driven by a sense of personal responsibility (and maybe a little ego) to find that f*cking bug.
Of course the best bug is one that never makes it off of your machine, and there are tons of useful practices for maximizing that outcome. Past that, most issues you find in production are still pretty run-of-the mill … typically some forgotten legacy edge case or unanticipated user behavior. These bugs show up, repro on test, get fixed, and you just feel a little stupid.
But then there are the good ones. And I’d be lying if I said I didn’t live just a little for them. It can be supremely frustrating and unpleasant in real-time, but the rush of hard-won success never gets old. Bizarre though it may seem, I still get revved up remembering this one from Adaptive days: https://importimmunity.wordpress.com/2014/12/08/the-definition-of-insanity-works/.
The Nine Rules
Every time I think I should package up what I’ve learned about production debugging, I remember that somebody has pretty much already done it, and better than I could. Go buy David Agan’s book Debugging and check out his site debuggingrules.com. His nine rules are solid gold and you will be a better engineer if you learn them.
Really the only thing I would add to David’s list is KEEP LOOKING. New (and bad) developers almost always give up too quickly: “there’s nothing more to learn here.” When you’re stuck, start over, even if it feels like you’re doing the same thing. The answer is there somewhere — in the code, in your logs, in your heap dumps, in your data. Our brains need time and repetition to see subtle patterns. Look again, and no it is not a bug in the operating system.
Developers always live under pressure to build more features, faster. Almost inevitably, this leads to technical debt — shortcuts that get the job done but accumulate badness as support burden or a lack of future flexibility. Good technical leaders find time for their teams to pay down debt; the best technical leaders create systems that minimize it. One of the best ways to do that is to invest in infrastructure that makes it easy to write live-debuggable code.
First (and typically not forgotten) is logging. No matter where code lives or runs, it should be braindead easy for developers to output time-stamped logs that are persistent, centralized and searchable. Code reviews should check for adequate logging — not just “I am here” but parameter and variable state, exceptions and stack traces, anything that will help with forensics later.
You should also prepare for logs to get huge by rolling and archiving them up front. Nobody should limit what they log over space concerns, at least not without a serious conversation. More is almost always better in the log run. See what I did there?
The other infrastructure you should build is what I lovingly refer to as “Debug World” — a set of pages or endpoints that provide information about the internals of your system. There are always key values that provide insight into system health and problems — but if exposing them is hard, it just doesn’t happen. Build in security, format, and access up front so that developers don’t need to think or worry. It’s great if these endpoints are machine-readable as well, because they can provide fantastic input for smart production alerts.
Looking into the Beast
Despite all the rules, all the log trolling, all the forethought and persistence — sometimes it really helps to be able to peek into running code in ways you didn’t expect up front. And what is great for doing that? Debuggers!
Debuggers? On production? The heck you say. Our industry has evolved a religion about preventing direct access to production systems. It’s an OK risk mitigation strategy, but (IMNSHO) it’s all downside when it comes to fixing problems and giving engineers a visceral understanding of how their code runs.
Production Access is not a Crime — but it can be dangerous. So while I am all for hooking up interactive debuggers in production when things get desperate, it is nice to also have an in-between that does the trick without taking away all the guardrails. Thanks to the JVM and Java Debug Interface, we can do exactly that — and it’s super cool. Check out my next post for some code that might actually make your life a lot more pleasant.