Clean Stops make Happy Ops

Sean Nolan

5 years ago

Whether it’s your web server or some background task, you’ve been there. One day you signal the process to stop and it just … doesn’t. Eventually you force kill it and everything seems ok. A few rounds later, your production folks get tired of the dance and write a script to automate the force kill, so everybody forgets about it and everything still seems ok.

Except you know it’s not really ok. Whatever thread is refusing to quit is in the middle of doing something. Killing it with prejudice just cuts that off at the knees, leaving the operation — and the data related to it — in an unknown state. You keep discovering this in little weird ways: “the database says that file was sent for processing, but it never happened;” “for some reason the lock on that record never got released;” “we never got the email notification for that account.”

Classic support creep stuff. It eats at developer productivity, as more and more of every day is spent hand-patching “weird” data issues that nobody really understands. My favorite part is when team leads start saying that they need more staff because of the support burden they made for themselves.

There are many root causes of support creep, but bad shutdown hygiene almost always plays a role. It’s worth a little infrastructure work to get it right.

Simple but Rude

Not many folks work directly with Threads anymore, which makes sense given the ubiquity of built-ins like ExecutorService and CompletableFutures and libraries like Akka. But all of these build upon lower-level constructs and inherit their weirdnesses, so let’s start there.

Cleanu p1.jav a is about the simplest threaded app you could imagine. It uses a background thread to compute increasingly large triangular numbers — just a random computation that we’re using to eat up CPU cycles. The “synchronized” keyword protects against concurrent data access; simple but effective.

This code doesn’t even make an attempt to stop cleanly — in fact I had to comment out the code I want to run at the end (lines 26-27) since it’s unreachable. Rude, but a useful starting point.

Our first important quirk — if you run the app with a parameter “nd” (java Cleanup1 nd) and then type x + enter, the app never quits! The issue here is the “daemon” flag on line 11. This flag is typically set on background threads and it’s a little good, but mostly bad. The scheduler keeps user interfaces responsive by giving daemons lower priority, but it also happily force-quits them on shutdown. This shutdown behavior is insane and just invites developers to do the wrong thing. Boo.

Pardon the Interruption

The typical native way to tell a thread to stop what it’s doing is Thread.interrupt. Targeted threads receive this message by regularly checking Thread.isInterrupted and/or catching InterruptedException. Each app needs to decide on an acceptable interval for “regularly checking” — more or less, the amount of time you’re asking the ops team to wait for shutdown to complete. InterruptedException is thrown by methods like Thread.sleep or Future.get that spend lots of time waiting around.

Cleanup2.java demonstrates this pattern. In the main thread we call worker.interrupt to signal shutdown and then worker.join to wait for it to complete. For the worker, we’ve replaced our forever loop with a call to isInterrupted and added exception handling. When you exit the app, you’ll now see a “Final Report” indicating a clean shutdown.

However, you may notice some hints that suggest we aren’t totally out of the woods. In particular, why are we calling interrupt in the exception handler? It turns out that throwing the exception — and in some cases just checking the flag — will reset the thread’s interrupt status. Why? Honestly I really haven’t ever heard a good answer (feel free to nerd-splain it to me if you like). The “best practice” is to reset this flag as needed; it’s just weird.

Also, since this app runs attached to a terminal, you can still trigger bad behavior by closing it with control-C. You can be extra-safe by adding a shutdown hook (ha!) or registering a signal handler, but these all have issues and don’t catch hard kills (SIGKILL) anyways.

The biggest issue with the interrupt pattern is that many common functions just ignore it — especially those involved in IO, which is quite often the whole reason for background processing in the first place. You can (and should!) try to be responsible by using interruptible streams, setting short timeouts and picking libraries like Apache HttpClient that can be aborted. But at the end of the day, it’s almost a given that some part of your stack will invoke blocking, non-interruptible scallywags.

Living in the Real World

You go to war with the army you have (hat tip Rummy). Solid shutdown code optimizes for the best possible outcomes using these three steps in order; you can seem them in action in Cleanup3.java. You can simulate best-case management of increasingly intransigent workers by passing an integer argument to this version (java Cleanup3 ARG):

0 = BEST: Ask nicely and keep it in the family. By defining your own mechanism to signal shutdown, you completely control the semantics and state changes, and being explicit leaves less to the imagination of future developers. If a thread shuts down this way, woo hoo!

1 = GOOD: Use native interruption to trigger exceptions. This helps with calls into third-party code, and is a good backup approach if your own developers get a little forgetful.

2 = BETTER THAN NOTHING: Make a scene. Sometimes there’s just nothing you can do. In this case, use logging and every other means at your disposal to make sure it gets noticed! The biggest enemy in this fight is invisibility — it is far cheaper to fix the root cause of shutdown failure once than to spend developer time cleaning up data messes forever.

There are a lot of edge cases here, and trying to provide a simple interface for developers while catching them all is a bit tricky. For example, we’re eating InterruptedExceptions in the waitForStop method, which you could argue is our own bad behavior. But it allows us to give workers the best possible chance for a graceful exit — it’s the same reason that we provide the Exception parameter to the cleanup method.

You’ll find that this three-part pattern is useful in many situations. On the front end we write the happy path (BEST), and at the back we implement damage control (BETTER THAN NOTHING) — we can’t do the right thing, but we can at least make sure it gets noticed. Those “bookends” represent a complete end-to-end solution; in between and over time we can add alternatives (GOOD) to improve real-world performance.

Tasks are Special

Most folks these days don’t use threads directly. Instead, they use something like Akka or native Executors which abstract away thread pool management. Discrete units of work are submitted to the service, wait for resources, do their thing, and optionally return results through a Future or Future-like structure.

A lot of upside here. The only downside is that, like many layered abstractions, they invite bad behavior at the edges. ExecutorService has a shutdown sequence similar to what we’ve been working with already: first call shutdown to tell the service to reject any new tasks and then awaitTermination to finish up existing ones. If things get stuck, the shutdownNow method invokes a more persuasive kill on the tasks.

What does “persuasive” mean? Turns out it’s exactly the same thread termination mechanisms that we’ve been talking about all along! So unless you can guarantee that your tasks are going to exit in an acceptable window, you have to write that same termination logic anyways.

Task-based systems introduce another more subtle issue — what happens to tasks that were queued but never got a chance to start? If you look at the documentation for shutdownNow, you’ll that see it returns that very list, but even official sample code usually ignores it. Shameful! The code that submitted those tasks almost certainly assumed they’d get done, and may even block for completion — don’t leave them hanging!

Be Persistent

Honestly, by the time you’re in this deep, you probably want to look at a more serious messaging system. In-process parallelization is excellent for IO management and to speed up certain algorithms, but it’s a lousy choice for most other things. It’s lousy for the very reasons that have us working so hard on clean shutdowns — to reap the benefits, the submitting code needs to be confident its tasks will not just disappear into the ether.

These days it’s hard to beat cloud-based systems like AWS SQS or Azure Storage Queues. They provide great assurances for both senders and receivers: once a message is accepted, it will be delivered, and every attempt will be made to get it successfully processed. That’s not to say developers on either side of the equation can’t still mess it up, but at least there’s solid ground to build on.

A distributed system also lets you better separate work on different processes and different machines. Should your web server really be processing credit card transactions, even in the background? No way.

Going even further, the real endgame here is an enterprise-wide publish/subscribe messaging bus. Well-designed pubsub is transformative for distributed systems, because it turns tightly-bound transactions (“I just accepted an order, charge the credit card now”) into composable events (“In case anyone cares, I just accepted an order”). In this model, any actor can listen for relevant events — not just the credit card processor, but the inventory-forecasting system and the data warehouse too.

It can be tough to make this happen, especially in the frantic early days of a startup; doing it well takes a reasonable amount of planning that can seem like overkill. Trust me, I know. But that’s something to dive into another day. For now … learn from my mistakes and just get it done!

Code from this post: Cleanup1.java, Cleanup2.java, Cleanup3.java

Simple but Rude

Pardon the Interruption

Living in the Real World

Tasks are Special

Be Persistent

Share this: