Graceful Shutdown

Use case

Imagine a web server. It’s handling many HTTP connections in parallel. The connections may have some kind of timeout: If there’s nothing coming from the cleint for a minute, the server shuts the connection down to prevent resource wastage and DoS attacks.

When the server itself is being shut down it stops accepting new connections and gives existing connections 10 second to cleanly shut down. After 10 seconds it forcefully cancels any remaining connectons and exits.

The problem

Let’s start with a single connection. To prevent DoS, it should always specify a deadline when doing a blocking operation:

data = = now() + 60);
if(data == TIMEOUT) handle_connection_deadline();

So far so good. Now let’s imagine that server is shutting down. It sends a message to all the connections saying “I want to terminate in 10 seconds, please try to do a clean shut down!”

That makes life harder for the connection. Now it has to deal with two different deadlines:

real_deadline = min(connection_deadline, server_deadline);
data = = real_deadline);
if(data == TIMEOUT) {
    if(now() > server_deadline) handle_server_deadline();
    else handle_connection_deadline();

What sucks about the above pattern is that it has to be done for every single blocking operation in the connection thread.

But it gets worse.

What if instead of two levels (server and connection) there are three levels? A launches B, which in term launches C. If B is already in process of gracefully terminating C and gets a graceful termination request from A – with a different deadline – what is it going to do? Will it pass the new deadline to C? Will C listen for new graceful termination requests even though it’s already in process of gracefully terminating? Will it do min() three values instead of two? And what if there are four levels? What if there are more?

In short, gracefull termination doesn’t compose and it may even break encapsulation: Each level would have to know about all the levels above it.

Modest proposal

I’ve been banging my head against this problem for a year or so until I’ve came with something that actually works.

The problem is that if you try to add more syntactic machinery to deal with the problem (soft-cancelation vs. hard-cancelation or somesuch) the semantics tend to gets complex and the simplicity and elegance of the entire structured concurrency model just drowns in the complexity.

There’s one crucial observation to be made before we can get out of the mess. Namely, to quote Leo Tolstoy, all hard cancelations are alike; each soft cancelation is happens in its own way.

Compare a thread that writes to disk and a thread that handles a WebSocket connection. If asked to hard-cancel both would simply exit. If asked to shut down gracefully though, the former will try to flush the buffers to the disk. The latter will try to do terminal handshake with the peer.

To put it differently, hard cancelation can be fully handled by the language runtime. Soft cancelation, on the other hand, always requires a some application-specific manual work.

And once we accept the fact that soft-cancelation is mostly an application issue, the solution is not hard to see.

The child thread will get a graceful termination request from the parent, but the request won’t contain any deadline. It would be a simple signal saying “please stop doing the normal work and switch to the termination phase”. The child will then, for example, stop writing logs to a file and it will try to flush the buffers instead.

Note how simple the control flow of the child is and how it requires no complex deadline gymnastics.

Now, let’s have a look at the parent thread.

It sends the graceful termination request to the child (or children), then it waits for either a.) child terminating b.) it’s own deadline expiring. In the former case the graceful shutdown was successful and we can move on. In the latter case the graceful shutdown wasn’t successful and the child thread is still running. We have to hard-cancel it. The code would look like this:

channel_to_child = channel()
child = scope.launch(child_body(channel_to_child));
res = scope.wait(deadline = 10s);
if(res == TIMEOUT || res == CANCELED) {

In the code above I am being verbose, to make it clear what’s happening. In reality, the last 4 lines can be combined in a single function, e.g. “scope.cancel(deadline = 10s)”.

Please note how the constuct composes. Each thread cares only about its own deadlines. All it has to know about the outer world is that it can get a graceful termination request from outside. Whether there are 5 or 10 nested levels of cancelation scopes above it, it doesn’t care. Also note how hard cancelation (res == CANCELED) cancels any ongoing soft cancelation attempts.

I’d love to hear what other people have to say on the topic!

1 Like

Real quick while I’m thinking about a more in depth response, I just want to check real quick: have you read Timeouts and cancellation for humans? It’s exactly about the problems you bring up in the beginning, of composing multiple levels of deadline together etc.

Yes, I did some time ago. I don’t recall anything about graceful shutdown though. I’m going to bed now, will re-read it tomorrow.

I feel like this is another place where you’re C’s limits are really getting in the way :-/ Does every parent need to have careful error-handling code to propagate cancellation from grandparents to grandchildren? Seems pretty awkward…

Can you show how your example works if you extend it to handle the case where this code block may itself receive a soft cancellation request from the outside world?

Does every parent need to have careful error-handling code to propagate cancellation from grandparents to grandchildren?

That’s an artifact of C. Of course, in Python this would be automatic.

Anyway, I’ve written a long post about the topic:

Don’t forget to add it to the resources page :-).

So the post helped me realize one point where we’ve been talking past each other! Trio has always had your bundle_cancel(..., timeout) operation – it’s baked into the cancel scopes design. For example, here’s a cut-and-paste of your first “trivial case” for reference:

// From the blog post; hypothetical future libdill
int main(void) {
    socket_t s = create_connected_socket();
    bundle_t b = bundle();
    bundle_go(b, worker(s));
    bundle_cancel(b, 10);
    return 0;

And here’s the same thing using Trio:

# Current Trio
async def main():
    s = await create_connected_socket()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(worker, s)
        await trio.sleep(60)
        nursery.cancel_scope.deadline = trio.current_time() + 10

(In real life you might make it deadline = min(deadline, new_deadline), or use your own cancel scope instead of the implicit one attached to nurseries, but you get the idea.)

The reasons we’ve been considering adding more than that are:

  • While graceful cancellation and hard cancellation are different in their effects, it still feels intuitive that they might be able to share a “targeting” mechanism – if I’ve gone to the trouble to wrap something in a cancel scope so I can cancel it, then can I re-use that to soft-cancel the same code?

  • If graceful cancellation is left up to the application code, then every third-party library has to provide some explicit mechanism for signalling graceful cancellation (or else be lazy and simply leave this feature out). One of the things I noted in Timeouts and cancellation for humans is that empirically, people absolutely fail to do this consistently even for critical functionality like hard cancellation; so, why should we expect them to do it for graceful cancellation?

  • We don’t want to implement recv_from_socket_or_channel :slight_smile:. Trio’s standard recv_from_socket already has support for exiting early in response to an external message (i.e., someone calling cancel() on some enclosing cancel scope). We should re-use this instead of inventing a second way to do it.

    This point doesn’t necessarily require changing things in Trio’s core machinery; you could build recv_from_socket_or_graceful_cancel on top of what we already have, by using cancel scopes and plumbing things together manually. (That’s the idea behind this comment.) But it might be simpler/more natural if it were built-in.

  • After we cancel some code, Trio normally forbids all blocking operations while unwinding. It’s been argued that it would be useful to allow some limited amount of blocking in cleanup code, limited by some grace period. This is the point I’m most dubious about, because of the same issues you raised in your post – the details of graceful unwinding are so application specific that I’m worried we can’t define a semantics for this that’s coherent and general enough to be useful. But maybe we can?

Together these points might be enough to tip us over the edge to building something in, since it is such a common need. But like your post argues, it actually is technically possible to handle everything cleanly with what we have now, so we’re taking a good long time to think things over :slight_smile:

Aha! I haven’t realized. And I did look (although briefly) at the cancel scopes. Maybe the problem is that the construct doesn’t feel like graceful shutdown. That overwriting of the deadline looks more like tweaking some kind of knob. Anyway, that may be just my C background speaking.

Hm, one of the points I was trying to make in the article was that they are nothing alike. Graceful shutdown feels like cancellation because it’s often paired with a timeout. But don’t get confused. It’s a separate thing that can be combined with the existing hard-cancellation mechanism. It can also be used alone if you don’t care about abandoning the graceful shutdown after a finite time.

I guess, people are not doing hard cancellation because it sucks so much that it is, in practice, not doable. Luckily, we already have a solution for that.

And no, I would expect most people to implement graceful termination. But the point I was trying to make was that I don’t care that much. If library X doesn’t implement graceful termination, so be it, it won’t be shut down cleanly. But that failure doesn’t propagate out of the library. The mechanism of graceful termination of the application would still work and the rest of the application would shut down cleanly. In other words, it’s a local failure, not the toxic thing that propagates through an entire codebase like goto does.

Can you point me to the docs/code? Doing search for recv_from_socket on Trio’s doc page doesn’t return anything.

This is something that really interests me. It seems to me that that requires getting good atomicity properties from a network socket so that signal can’t interrupt the socket in the middle of receiving a message. Not easy to do, unless, of course, you want to wrap each socket in a dedicated coroutine.

Yes, that’s the reasonable option for hard cancellation.

I am pretty sure about this one: All blocking operations should be allowed during graceful shutdown. In the end, graceful shutdown is just part of the application logic. It may be kind of separate in the programmer’s mind, but from the point of view of the language there’s nothing special about it.

That, of course, changes when graceful shutdown is hard-cancelled. From that point on, all blocking operations should be forbidden. It’s hard cancellation after all.

I feel it may be a question of education, really. Show programmer a few examples of graceful shutdown done right and they’ll get the gist.

Enforcing good habits via language design seems to be almost impossible in this particular case.

Anyway, if you come up with something, I’d be curious to see it.

Right. In Go, the way they handle cancellation is literally “pass a channel around to everyone to carry the cancel signal”, just like your graceful cancellation proposal. And they’ve done a huge amount of work to make this ergonomic, and plumb it through the ecosystem everywhere, but… you still can’t use it to wake up basic socket calls like accept or recv. (And those are the two calls you need to interrupt to do graceful shutdown of an HTTP/1.1 server.) So based on their experience, I’m guessing your graceful cancellation system will see pretty limited use. Maybe that’s still the best possible outcome once you take into account the limitations of other approaches, though.

Oh yeah, I was using your name for the function :-). Our core Stream API uses the same semantics as BSD recv(2), i.e., it allows short reads – see So that makes atomicity much easier. (Though there are still some corner-cases where receive_some isn’t atomic wrt cancellation. For example, cancelling SSLStream.receive_some can corrupt the stream’s internal state, if you get really unlucky and it’s in the middle of a TLS renegotiation. Fortunately renegotiation is mostly deprecated.)

In practice, after receive_some is cancelled, usually the next thing you do is close the stream anyway. Either your whole routine is being cancelled from outside, or in the graceful cancellation case you’ve specifically selected a receive_some that you know you want to break out of, so you get a chance to think about the consequences of any lack-of-atomicity when opting-in.

Ah, now I see what we are talking about here.

So, I haven’t made one of my assumptions explicit. Namely, that most coroutines don’t need graceful shutdown. I am looking at a codebase I am woking on at job now, it’s in Go, so there’s a lot of goroutines, but graceful shutdown is rarely needed, definitely below 10% of cases. (For example, a coroutine that handles WebSocket connections may need GS, but the coroutine doing accept loop does not.)

If you color the coroutines that need GS in the call tree red, it looks more or less like this:


Now, the thing is that unlike with Golang’s contexts, the white circles don’t have to do any additional work to support GS. If they support hard cancellation they are good to do.

So, in the end, I would imagine authors of libraries that require GS to actually implement it (if you are implementing WebSocket library, you want to handle CLOSE somehow, anyway, so you may as well do an actual GS) and everyone else to ignore it. The entire system would still work.

One part where the language may help, I think, is to propagate the GS period down the call tree automatically. I.e. if the coroutine is GS-unaware and doesn’t specify GS period when shutting down its children, the GS period specified by its parent will be used.

Ah, that’s a pitty. I thought you may have came with something reasonable. Making socket API actually usable (i.e. no short reads) is my second big goal with libdill (or, really, the first one). But it has nothing to do with structured concurrency, so let’s drop the discussion.

OK wait now I’m confused! I would have said that an accept loop is the canonical example of a routine that needs GS. Like, the most famous example of graceful shutdown is what Apache/nginx/etc. do, right? And for them it means:

  • Shut down the accept loops (→ cancel any pending accept calls, and stop issuing new ones)
  • Stop accepting new requests on existing connections (→ for all connections that are waiting for a new request to arrive, cancel the pending recv call and close the connection)
  • Wait for active requests to complete, possibly with some timeout

You might also want to pass the GS signal on to the actual request handlers, in case any of them are implementing a “long poll” protocol or similar. (This is another example where automatic GS propagation could be useful – in many server frameworks, there are multiple layers of third-party libraries between the top-level application and the request handler.)

TBH I also don’t understand what you mean about websockets. We model websockets as objects that you can send and receive messages on, but they don’t own any tasks. And the RFC says that the response to a CLOSE frame is always to immediately send a CLOSE frame back, which doesn’t seem like it needs any coordination with tasks?

FWIW, I’ve thought about it pretty hard and I’m pretty sure that short-reads are the best-of-all-possible-primitives here, since they’re the only thing that lets you easily and efficiently implement all the alternatives on top. Agreed that this is an unrelated topic though :slight_smile:

Yeah, sounds like we are talking pass each other. Let’s write some code. Here’s an accept loop:

coroutine void accept_loop(listener) { 
    b = bundle();
    while(1) {
        s = accept(listener);
        if(s == ECANCELED) break;
        bundle_go(b, connection_handler(s));
    bundle_cancel(b, 10);

As can be seen the coroutine accepts no GS signal. It just implements hard cancellaction (assuming it’s the “hard cancel with timeout” as per my proposal). Of course, the interesting question is where has it got the number 10 from. I think we both agree that it would be nice if that number could be somehow inherited from the parent.

Sorry, I’ve been imprecise there. What I’ve meant was “user-defined coroutine that handles a WS connection”. Here’s the user may want to get an explicit GS signal and initiate the terminal WS handshake (omitting hard cancellation parts for brevity):

coroutine void ws_handler(s, ch) {
    while(1) {
        message_t msg;
        recv(ch, &msg);
        if(msg == "STOP") break;
        send(s, &msg);
    // GS begin.
    // GS end.

Btw, we should really draw this on a whiteboard. Aren’t you traveling to Europe any time soon? I am unfortunately not going to get to US in foreseeable future.

Wait a second. I think I see what you are alluding to now. The accept loop would have to explicitly pass GS signal to WS coroutines, because otherwise they would just happily run until they are hard-cancelled. Let me think about this some more.

OK, I see, sure, for the accept loop, hard cancellation and GS are the same. But say I have a signal handler up at the top of my call tree that wants to GS my whole program. How does it find all the accept loops in order to send them a hard cancellation?

This is what I was trying to get at in a previous message… Imagine we could write something like:

with upgrade_graceful_cancel_to_hard_cancel:
    while True:
        conn = listener.accept()
        nursery.start_soon(handler, conn)

Now our signal handler can send a GS to the whole program, and each accept loop upgrades incoming GS into a hard cancel. The point is: now signal handler doesn’t have to know about the accept loops, or vice-versa.

This is also an issue, but if we have a mechanism for the signal handler to talk to the accept loop, then I guess we can re-use it to let the signal handler talk to the individual WS coroutines too?

Unfortunately no :frowning:

Ok, you are totally right. Back to the drawing board :frowning:

One additional problem: HC signal can’t be used as a substitute for GS signal in GS-unaware coroutines because HC signal causes all blocking functions in the target coroutine to return ECANCELED. Therefore, the targeted coroutine wouldn’t be able to any meaningful work.

Still, I think the design goal of making white nodes in the graph posted above just work, without any additional GS-related code, is worth pursuing.

Let me think out loud.

bundle_cancel() could send a GS signal to the children immediately, then wait for the timeout period, then send HC signal.

The problem with that is that GS signal is application specific (see in-band vs. out-of-band discussion in my blog post). But bundle_cancel() is part of the language and knows nothing about the application logic. Thus, it can’t send the GS signal.

So let’s look at it from a different angle. What if there were a kind of poor-man’s-GS-signal that would be sent automatically by bundle_cancel() but, at the same time, there would be a way to overload it in the GS-aware coroutines?

That would mean that the parent of a GS-aware coroutine would have to GS it in a correct way (automatic GS signal vs. manual one). That sucks, but the good news is that the grandparent wouldn’t have to know. From this perspective it looks like GS signal could be a local contract between the parent and the child. It wouldn’t have any effect on the code beyond those two coroutines.

Grrr. I’ve tried to prototype it and even that wouldn’t be sufficient.

The problem is that GS-unaware nodes wouldn’t propagate GS signal to their children.

If, on the other hand, language wanted to do that automatically, it would have to be aware of the all the parent-child relationships. And while that is a relatively reasonable expectation, libdill, at the moment, doesn’t have that info.

And even if it worked, there’s still the use case of GS-aware coroutine postponing GS of its children. This may not be the most common use case but I can imagine it could be needed. How would it accomplish that given that language runtime would propagate the signal behind the scenes?

In Trio, we have the same issue with hard-cancel. (Which the language propagates automatically, behind the scenes, taking advantage of its knowledge of parent-child relationships.) We solve it with a construct called a “shield” – you can write:

with trio.open_cancel_scope(shield=True):
    # Code in here can't "see" cancellations coming from outside
    # until the shield is removed. Cancellations that originate
    # *inside* the shield still work, though.

I guess in libdill’s case, you could do something like set_current_coroutine_shield(SOFT) (vs NONE vs HARD)? Or make it an attribute on the bundle or something?

Yes, I’ve seen that.

Still, the entire concept of behind the scenes distribution of the signal feels askew.

Let’s say we have the following call tree (same color scheme as before):


Say A enters GS mode. Suddenly, C starts doing GS while B is unaware of it. For B it looks like the C closed all by itself, without being asked to do so. Which is, kind of, like a violation of the contract between B and C.

All in all, it feels like reintroducing goto under a different name.

Thinking about it some more, the claim about composability from my blog post still stands, although in a somehow limited manner.

Namely, if there is an uninterrupted chain of GS-aware components between the initiator of the shutdown (A) and the target coroutine ©, the target would gracefully shut down:


Most importantly, the GS process wouldn’t disrupt cancellation of non-GS-aware coroutines (D, F) in any way.

However, a GS-capable coroutine (E) owned by a parent which doesn’t care about GS (D) is just going to be HC-ed.

But maybe that’s the reasonable thing to do?

Sorry to jump in, but I had some thoughts.

If B is not GS aware, what is B’s expected behavior when C exits (without an exception)?

  • B doesn’t know C exited. This is fine since HC will clean up B anyway.
  • B exits as soon as all of its children exit. B eventually exits which is what we wanted.
  • B detects that C exited and decides to spawn C again. C gets no work because hopefully we’ve stopped accepting new work in B’s ancestors. Both B and C get HC and exit.
  • B detects C exited and treats it like an error state, raising an exception or exiting with an error. This doesn’t seem like a resilient program unless that error is handled before it brings down the entire process. Hopefully the program author sees this crash in practice and adds appropriate handling in an ancestor during GS? I guess this is more of an escalation from GS to HC while the stack unwinds up to some parent, potentially hurting GS across the tree.

The last case is problematic for GS. It sucks that a single leaf in the tree could cause HC behavior across many other nodes, but that’s the point of structured concurrency - at least they exit and have an opportunity to clean up some resources.

Honestly the last case feels like a bug in B since the absence of an error from C on exit probably shouldn’t propagate out as an error - the author of B doesn’t really understand the contract with C.

If GS is an application construct, is has to be passable to every layer of your program. Every library needs to implement support for it and they all need to follow some standard conventions (or you need to add extra layers to convert between them). If D does not follow that convention, E will never get the signal and will get HC.

I like this suggestion:

I would expect every checkpoint within the graceful cancel scope to act as though the scope was cancelled. GracefulCancelScopes created after the parent cancel scope has entered graceful cancel will immediately raise a cancellation on the first checkpoint.

It would be nice if every cancel scope could have a graceful_cancel() method that would propagate like regular cancel() but only be raised within GracefulCancelScopes. That way you can control what parts of your program are gracefully shutting down.

Or maybe every cancel scope has a graceful property that can be set to True. If you are in a scope where graceful = True (which is also inherited from its parent scope), any checkpoint you hit will raise a cancel.

async def producer(queue):
        while True:
            with trio.GracefulCancelScope() as cancel_scope:
                # or set cancel_scope.graceful = True
                message = await conn.recv()
            if cancel_scope.cancel_caught:
            await queue.put(message)  # don't want to cancel this on GS
        # this will succeed on GS but will raise on HC
        await queue.put(STOP_SENTINEL)
        await conn.send_close()  

async def consumer(queue):
    while True:
        item = await queue.get()
        if item is STOP_SENTINEL:
            await queue.put(STOP_SENTINEL)  # make sure the other consumers get the message
        # do stuff with item 

async def server(queue):
    # A case where you want to make sure the consumers drain the queue on GS
    with trio.open_nursery() as nursery:
        nusery.start_soon(producer, queue)
        for i in range(5):
            nusery.start_soon(consumer, queue)

You should also be able to set graceful = False to disable this behavior. It is a little like shield except that you can opt in again later.

async def first():
    with trio.GracefulCancelScope():
        # any checkpoint in second will cause a cancellation on graceful_cancel()
        await second()

async def second():
    with trio.CancelScope() as cancel_scope:
        # if this does not shield from the graceful scope above
        # then any checkpoint in third() will cause a cancellation on graceful_cancel()
        # use cancel_scope.graceful = False to disable
        await third()

async def third():
    # since this is graceful cancel aware,
    # it should be explicit about blocking graceful cancel
    with trio.CancelScope(graceful=False):
        with trio.GracefulCancelScope():
            trio.sleep(10)  # simulate a good graceful stopping point
        trio.sleep(100)  # simulate work you don't want to gracefully stop

In the interest of keeping things simple I think new cancel scopes should inherit the graceful bit. To intentionally handle graceful cancel within your code, you should block at the entrypoint when entering code that supports graceful cancel. From above, server becomes:

async def server(queue):
    # A case where you want to make sure the consumers drain the queue on GS
    with trio.open_nursery() as nursery:
        nursery.cancel_scope.graceful = False  # protect children since this is GS aware
        nusery.start_soon(producer, queue)
        for i in range(5):
            nusery.start_soon(consumer, queue)

# calls of server
async def worker():
    with trio.GracefulCancelScope():
        # if the nursery cancel scope within server does not disable the graceful bit
        # the consumers would just die instead of gracefully exiting
        await server(queue)  

async def main():
    with trio.open_nursery() as nursery:

This keeps things simple for the runtime and libraries. If your code has no concept of graceful cancellation, the cancellation scope above you will determine if you are cancelled on graceful cancellation or not.

# if conn.recv() makes a new cancel scope, it must inherit the graceful bit
# otherwise any caller can't opt in to graceful cancellation when calling this coroutine
async def recv(self):
    with trio.move_on_after(TIMEOUT):
        # read from somewhere

In this way, the graceful bit is different from shield in that a graceful cancel scope should be cancelled if any scope above it calls graceful_cancel, even if there is a graceful = False scope in between.

I am not sure I follow here. If B treated C exiting without an error as an error then there would be no way for a scope to finish without throwing an exception, no?

Yes, that’s what I’ve been saying: If B doens’t care about shutting down its children gracefully they will just be HC’ed. What you’ve asked for is what you get.

However, I believe people will be split about this and specifically, that they would be split along language lines. People with C and Golang background will favour the proposition above. People coming from higher level languages would want the GS signal to propagate down the tree automatically.

In fact, we may be replaying the entire exceptions vs. error codes discussion here under a different name.

One can think of GS propagation as a kind of exception that travels down the stack instead of up the stack.

And Python/Java/C++ people would use the classic pro-exception arguments: Programmers are lazy and careless and unless the exception is propagated automatically, they will never propagate it by hand.

And C/Golang people would resort to the well-known anti-exception arguments: Exceptions are creating a second, implicit workflow in the program. A workflow that is rarely used (and often not even properly thought about) and the unforeseen interactions between the two workflows will result in subtle, hard to catch bugs.

And maybe that’s not a problem. Each language would implement the GS signal propagation in a way that best aligns with its philosophy.

All of that being said, the balance of concerns is tilted a little bit differently for error propagation and for GS signal propagation: If error is not handled, that’s a bad™ thing and can corrupt the program in really bad ways. If GS is not handled, the worse that could happen is that the coroutine will be HC’ed. Which is also bad, but maybe a little less bad. You draw your conclusions yourself.

I won’t comment on the proposal because I am not very familiar with how Trio works, but you’ll still have to deal with the problems outlined in my blog post and in this thread: GS signal should be propagated in band. But the language doesn’t know what “in-band” means. Delivery of GS signal should not be disruptive (e.g. throwing an exception) – if it leaves an object in inconsistent state, the application code handling the GS would probably fail. Etc.