Dirty secret — I love hunting production bugs. Tracing code with a notebook full of variable state, correlating distributed logs in time, trolling diffs, building and discarding theories — it’s nerd heaven. Best of all, the stories about what went wrong and why are almost always beautiful tragicomedies that make us better engineers and keep our egos in check.
Tools of the Trade
There are some amazing devops services out there these days. I’m a particular fan of code-aware ones like New Relic — seeing their on-demand JVM profiling feature for the first time was a religious experience. And workhorses like Sumo Logic do a great job consolidating distributed logs into a unified picture. In the cloud, even just the basic management stuff being released by Amazon and Microsoft is pretty impressive.
But at least for me, there’s still no substitute for getting onto the machine. You can poke around for suspicious files in
/tmp, see how processes react to stimuli in truly real-time, and just get a sense of reality that is hard to see through levels of indirection.
The beauty of working with the JVM is that you can really, really look at what’s going on inside. Not many folks have internalized the power of the JPDA and how easy it is to leverage using the high-level Java Debug Interface. So let’s build something useful with it and see.
jdb is a great tool, but it can be a little cryptic and basically has no guardrails. Debuggers built into IDEs are easier to use, but can be similarly sketchy to hook up in production. And none of them are made to answer really custom questions, like — show me the arguments for method X, but only when method Y is on the stack and when static variable Z is true. This kind of thing is quite common in production bug scenarios, where pinpointing a repro can be challenging.
That’s the purpose of “JDB Jr”, the code for which you can find and steal on github. It’s an extremely lightweight app (compile with
javac *.java, no external dependencies required) that connects to the debugging socket for a JVM, executes whatever code you want, and gets out of Dodge. I’ve added a few “actions” mostly as reference for the common JDI objects (run
java JdbJr to see a list), but it’s easy to add your own, which is the point.
Also in that repository is a super-simple app (
TimeBot.java) that makes a handy debugee for demonstration purposes. It just listens on a raw socket and sends the current time down the wire a few times at two second intervals. To see full variable goodness, compile it with
JDI and JdbJr Basics
Most of what you want to do with JDI starts with a
VirtualMachine object, which you acquire using a
Connector to link up with the target JVM. There are many flavors here, but since we’re going to be debugging an app that is already live, we’ve chosen the socket-based “attaching” connector. This code is in
JdbJr.java — it sets up the connection; calls an abstract “whatever” method to do, well, whatever; and registers a shutdown hook so that we properly clean things up.
With a VirtualMachine in hand, it’s pretty simple to just party at will on the debugee. Some of those things require suspending the target VM (“breaking into the debugger”) but there are plenty of opportunities for voyeurism that don’t even require that. Some simple examples (all of these take advantage of common display helpers in
ClassListActionjust lists every loaded class in the debugee. This is mostly for demonstration purposes, but combined with
grepit can be handy to verify fully-qualified names, especially with nested classes or lambdas that use hard-to-remember conventions.
ThreadListActionlists all threads in the system with their current stacks. It’s basically the same as the
jstackutility; super-helpful for tracking down hangs.
All of these do their work and then exit quickly, so even the ones that require suspension aren’t likely to cause any noticeable performance degredation. That’s one of the key benefits to this approach — getting in and out quickly and automatically keeps production impact to a minimum.
Monitoring Events over Time
The other use for a tool like this is to catch state during rare events. Most good production bugs are hard to reproduce — it can take weeks to get good data fighting through the addlogging-prop-wait-oops-morelogging-prop-wait cycle.
WatchMethodsAction demonstrates how you can trap these live and without disturbing your production code. There are a ton of debugee
Events that debuggers can register to observe. In this code we register for the
MethodEntryEvent using the
MethodEntryRequest to filter by class name (which can contain a wildcard “*” at the start or end) and optionally by method within the class.
Once registration is set up, we just loop, pulling events off of the queue. They come in batches, and do suspend the debugee during processing, so you’ll want to get in and out as fast as you can.
WatchMethodsAction just outputs the stack trace and argument values to provide context.
This is where magic can happen, once you really get that this is all just code. Your watchpoints and breakpoints can be as complicated and specific as you need them to be. Just a few off the top of my head:
- Identify which logged-in user is causing a particular exception
- Report detailed memory stats before and after executing methods in a class
- Identify when key methods are taking longer than expected
- Get stack traces based on random scraps of data like strings sent to String.format
Maybe go really wild and change debugee state using the
ObjectReference.setValue method (
setValue is also available in other contexts). This can be great for patching known issues — e.g., use Events to track when a cache has become poisoned and reset it in place.
And the biggest gun of all? Run code, baby.
ObjectReference.invokeMethod is a bit of a beast, but lets you do things like, for example, force-close hung sockets (near and dear to Azure developers using Java). Some hacks are so ugly they are just beautiful.
The Obvious Catch(es)
This is all so easy that it seems like there must be a catch or two — and of course there are. The smallest is that, in order to see local variable context, the debugee needs to be compiled with the “-g” flag that adds more complete metadata to the class files. This isn’t really an issue, but it does make your files a little bigger.
More problematic, the debugee must be started with debugging enabled, for example:
java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=12345 TimeBot
These flags tell the JVM to listen on a socket for incoming debugger connections. You can’t enable debugging for a process once it’s live — so if you didn’t think ahead to run this way, at least your first repro of a production bug will have to go unobserved (buck up, there’s evidence in your logs too).
Unless you, um, just run with debugging enabled all the time. Doing this means you’re always ready to attach and poke around. And hey, why not? Two reasons to consider:
1. Performance. Despite a ton of improvements in recent JVMs that minimize the cost, running in debug mode still disables some code optimizations, and that can mean non-trivial performance degradation. My take — it’s probably ok for you to ignore this one. CPUs are remarkably fast and getting faster all the time; it is the rare enterprise application that even makes their CPUs sweat. The performance hit is real — just be thoughtful about your code and whether it matters.
2. Security. I doubt that many security officers are still reading this post; the very concept of open debugging sockets on production has likely already caused heart failure or at least serious palpitations. But just in case, I will acknowledge this one as deadly serious. There is no security on that socket; anybody who can find a route there basically has total control of your JVM.
This is especially bad if you’re using a version of Java before JDK 9, which bound its debugging socket to all interfaces. At least with JDK 9+, the default is to only bind on localhost, which dramatically reduces your exposure (if you need to, you can open this back up by providing a hostname or wildcard in the address parameter).
What to do here? For older versions of Java, it’s probably scary enough to disqualify an always-on debugging approach. But for newer JVMs, with a competent network administrator, it can work ok — just bind to localhost and use firewall rules to prevent access by default. When you want to connect, flip off the firewall and you’re good to go. If somebody has rooted you system enough to turn off the firewall and access the port from the local machine — you’re screwed anyways.
You know you want to try this
Rules of thumb can be helpful. But blindly following them because they are rules makes for a boring and often suboptimal existence. Custom debugging on production with JDI can be incredibly powerful that should be in the toolbox of a great engineer. Be thoughtful and use it wisely — but use it for sure. Until next time!
Find all the code for this post here: https://github.com/seanno/shutdownhook/tree/main/jdbjr.