Graceful Shutdown

#3

Yes, I did some time ago. I don’t recall anything about graceful shutdown though. I’m going to bed now, will re-read it tomorrow.

Structured Concurrency Kickoff
#4

I feel like this is another place where you’re C’s limits are really getting in the way :-/ Does every parent need to have careful error-handling code to propagate cancellation from grandparents to grandchildren? Seems pretty awkward…

Can you show how your example works if you extend it to handle the case where this code block may itself receive a soft cancellation request from the outside world?

#5

Does every parent need to have careful error-handling code to propagate cancellation from grandparents to grandchildren?

That’s an artifact of C. Of course, in Python this would be automatic.

Anyway, I’ve written a long post about the topic:

http://250bpm.com/blog:146

#6

Don’t forget to add it to the resources page :-).

So the post helped me realize one point where we’ve been talking past each other! Trio has always had your bundle_cancel(..., timeout) operation – it’s baked into the cancel scopes design. For example, here’s a cut-and-paste of your first “trivial case” for reference:

// From the blog post; hypothetical future libdill
int main(void) {
    socket_t s = create_connected_socket();
    bundle_t b = bundle();
    bundle_go(b, worker(s));
    sleep(60);
    bundle_cancel(b, 10);
    return 0;
}

And here’s the same thing using Trio:

# Current Trio
async def main():
    s = await create_connected_socket()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(worker, s)
        await trio.sleep(60)
        nursery.cancel_scope.deadline = trio.current_time() + 10

(In real life you might make it deadline = min(deadline, new_deadline), or use your own cancel scope instead of the implicit one attached to nurseries, but you get the idea.)

The reasons we’ve been considering adding more than that are:

  • While graceful cancellation and hard cancellation are different in their effects, it still feels intuitive that they might be able to share a “targeting” mechanism – if I’ve gone to the trouble to wrap something in a cancel scope so I can cancel it, then can I re-use that to soft-cancel the same code?

  • If graceful cancellation is left up to the application code, then every third-party library has to provide some explicit mechanism for signalling graceful cancellation (or else be lazy and simply leave this feature out). One of the things I noted in Timeouts and cancellation for humans is that empirically, people absolutely fail to do this consistently even for critical functionality like hard cancellation; so, why should we expect them to do it for graceful cancellation?

  • We don’t want to implement recv_from_socket_or_channel :slight_smile:. Trio’s standard recv_from_socket already has support for exiting early in response to an external message (i.e., someone calling cancel() on some enclosing cancel scope). We should re-use this instead of inventing a second way to do it.

    This point doesn’t necessarily require changing things in Trio’s core machinery; you could build recv_from_socket_or_graceful_cancel on top of what we already have, by using cancel scopes and plumbing things together manually. (That’s the idea behind this comment.) But it might be simpler/more natural if it were built-in.

  • After we cancel some code, Trio normally forbids all blocking operations while unwinding. It’s been argued that it would be useful to allow some limited amount of blocking in cleanup code, limited by some grace period. This is the point I’m most dubious about, because of the same issues you raised in your post – the details of graceful unwinding are so application specific that I’m worried we can’t define a semantics for this that’s coherent and general enough to be useful. But maybe we can?

Together these points might be enough to tip us over the edge to building something in, since it is such a common need. But like your post argues, it actually is technically possible to handle everything cleanly with what we have now, so we’re taking a good long time to think things over :slight_smile:

1 Like
#7

Aha! I haven’t realized. And I did look (although briefly) at the cancel scopes. Maybe the problem is that the construct doesn’t feel like graceful shutdown. That overwriting of the deadline looks more like tweaking some kind of knob. Anyway, that may be just my C background speaking.

Hm, one of the points I was trying to make in the article was that they are nothing alike. Graceful shutdown feels like cancellation because it’s often paired with a timeout. But don’t get confused. It’s a separate thing that can be combined with the existing hard-cancellation mechanism. It can also be used alone if you don’t care about abandoning the graceful shutdown after a finite time.

I guess, people are not doing hard cancellation because it sucks so much that it is, in practice, not doable. Luckily, we already have a solution for that.

And no, I would expect most people to implement graceful termination. But the point I was trying to make was that I don’t care that much. If library X doesn’t implement graceful termination, so be it, it won’t be shut down cleanly. But that failure doesn’t propagate out of the library. The mechanism of graceful termination of the application would still work and the rest of the application would shut down cleanly. In other words, it’s a local failure, not the toxic thing that propagates through an entire codebase like goto does.

Can you point me to the docs/code? Doing search for recv_from_socket on Trio’s doc page doesn’t return anything.

This is something that really interests me. It seems to me that that requires getting good atomicity properties from a network socket so that signal can’t interrupt the socket in the middle of receiving a message. Not easy to do, unless, of course, you want to wrap each socket in a dedicated coroutine.

Yes, that’s the reasonable option for hard cancellation.

I am pretty sure about this one: All blocking operations should be allowed during graceful shutdown. In the end, graceful shutdown is just part of the application logic. It may be kind of separate in the programmer’s mind, but from the point of view of the language there’s nothing special about it.

That, of course, changes when graceful shutdown is hard-cancelled. From that point on, all blocking operations should be forbidden. It’s hard cancellation after all.

I feel it may be a question of education, really. Show programmer a few examples of graceful shutdown done right and they’ll get the gist.

Enforcing good habits via language design seems to be almost impossible in this particular case.

Anyway, if you come up with something, I’d be curious to see it.

#8

Right. In Go, the way they handle cancellation is literally “pass a channel around to everyone to carry the cancel signal”, just like your graceful cancellation proposal. And they’ve done a huge amount of work to make this ergonomic, and plumb it through the ecosystem everywhere, but… you still can’t use it to wake up basic socket calls like accept or recv. (And those are the two calls you need to interrupt to do graceful shutdown of an HTTP/1.1 server.) So based on their experience, I’m guessing your graceful cancellation system will see pretty limited use. Maybe that’s still the best possible outcome once you take into account the limitations of other approaches, though.

Oh yeah, I was using your name for the function :-). Our core Stream API uses the same semantics as BSD recv(2), i.e., it allows short reads – see trio.abc.ReceiveStream.receive_some. So that makes atomicity much easier. (Though there are still some corner-cases where receive_some isn’t atomic wrt cancellation. For example, cancelling SSLStream.receive_some can corrupt the stream’s internal state, if you get really unlucky and it’s in the middle of a TLS renegotiation. Fortunately renegotiation is mostly deprecated.)

In practice, after receive_some is cancelled, usually the next thing you do is close the stream anyway. Either your whole routine is being cancelled from outside, or in the graceful cancellation case you’ve specifically selected a receive_some that you know you want to break out of, so you get a chance to think about the consequences of any lack-of-atomicity when opting-in.

#9

Ah, now I see what we are talking about here.

So, I haven’t made one of my assumptions explicit. Namely, that most coroutines don’t need graceful shutdown. I am looking at a codebase I am woking on at job now, it’s in Go, so there’s a lot of goroutines, but graceful shutdown is rarely needed, definitely below 10% of cases. (For example, a coroutine that handles WebSocket connections may need GS, but the coroutine doing accept loop does not.)

If you color the coroutines that need GS in the call tree red, it looks more or less like this:

gs3

Now, the thing is that unlike with Golang’s contexts, the white circles don’t have to do any additional work to support GS. If they support hard cancellation they are good to do.

So, in the end, I would imagine authors of libraries that require GS to actually implement it (if you are implementing WebSocket library, you want to handle CLOSE somehow, anyway, so you may as well do an actual GS) and everyone else to ignore it. The entire system would still work.

One part where the language may help, I think, is to propagate the GS period down the call tree automatically. I.e. if the coroutine is GS-unaware and doesn’t specify GS period when shutting down its children, the GS period specified by its parent will be used.

Ah, that’s a pitty. I thought you may have came with something reasonable. Making socket API actually usable (i.e. no short reads) is my second big goal with libdill (or, really, the first one). But it has nothing to do with structured concurrency, so let’s drop the discussion.

#10

OK wait now I’m confused! I would have said that an accept loop is the canonical example of a routine that needs GS. Like, the most famous example of graceful shutdown is what Apache/nginx/etc. do, right? And for them it means:

  • Shut down the accept loops (→ cancel any pending accept calls, and stop issuing new ones)
  • Stop accepting new requests on existing connections (→ for all connections that are waiting for a new request to arrive, cancel the pending recv call and close the connection)
  • Wait for active requests to complete, possibly with some timeout

You might also want to pass the GS signal on to the actual request handlers, in case any of them are implementing a “long poll” protocol or similar. (This is another example where automatic GS propagation could be useful – in many server frameworks, there are multiple layers of third-party libraries between the top-level application and the request handler.)

TBH I also don’t understand what you mean about websockets. We model websockets as objects that you can send and receive messages on, but they don’t own any tasks. And the RFC says that the response to a CLOSE frame is always to immediately send a CLOSE frame back, which doesn’t seem like it needs any coordination with tasks?

FWIW, I’ve thought about it pretty hard and I’m pretty sure that short-reads are the best-of-all-possible-primitives here, since they’re the only thing that lets you easily and efficiently implement all the alternatives on top. Agreed that this is an unrelated topic though :slight_smile:

#11

Yeah, sounds like we are talking pass each other. Let’s write some code. Here’s an accept loop:

coroutine void accept_loop(listener) { 
    b = bundle();
    while(1) {
        s = accept(listener);
        if(s == ECANCELED) break;
        bundle_go(b, connection_handler(s));
    }
    bundle_cancel(b, 10);
}

As can be seen the coroutine accepts no GS signal. It just implements hard cancellaction (assuming it’s the “hard cancel with timeout” as per my proposal). Of course, the interesting question is where has it got the number 10 from. I think we both agree that it would be nice if that number could be somehow inherited from the parent.

Sorry, I’ve been imprecise there. What I’ve meant was “user-defined coroutine that handles a WS connection”. Here’s the user may want to get an explicit GS signal and initiate the terminal WS handshake (omitting hard cancellation parts for brevity):

coroutine void ws_handler(s, ch) {
    while(1) {
        message_t msg;
        recv(ch, &msg);
        if(msg == "STOP") break;
        send(s, &msg);
    }
    // GS begin.
    ws_do_terminal_handshake(s);
    // GS end.
}

Btw, we should really draw this on a whiteboard. Aren’t you traveling to Europe any time soon? I am unfortunately not going to get to US in foreseeable future.

#12

Wait a second. I think I see what you are alluding to now. The accept loop would have to explicitly pass GS signal to WS coroutines, because otherwise they would just happily run until they are hard-cancelled. Let me think about this some more.

#13

OK, I see, sure, for the accept loop, hard cancellation and GS are the same. But say I have a signal handler up at the top of my call tree that wants to GS my whole program. How does it find all the accept loops in order to send them a hard cancellation?

This is what I was trying to get at in a previous message… Imagine we could write something like:

with upgrade_graceful_cancel_to_hard_cancel:
    while True:
        conn = listener.accept()
        nursery.start_soon(handler, conn)

Now our signal handler can send a GS to the whole program, and each accept loop upgrades incoming GS into a hard cancel. The point is: now signal handler doesn’t have to know about the accept loops, or vice-versa.

This is also an issue, but if we have a mechanism for the signal handler to talk to the accept loop, then I guess we can re-use it to let the signal handler talk to the individual WS coroutines too?

Unfortunately no :frowning:

#14

Ok, you are totally right. Back to the drawing board :frowning:

One additional problem: HC signal can’t be used as a substitute for GS signal in GS-unaware coroutines because HC signal causes all blocking functions in the target coroutine to return ECANCELED. Therefore, the targeted coroutine wouldn’t be able to any meaningful work.

Still, I think the design goal of making white nodes in the graph posted above just work, without any additional GS-related code, is worth pursuing.

Let me think out loud.

bundle_cancel() could send a GS signal to the children immediately, then wait for the timeout period, then send HC signal.

The problem with that is that GS signal is application specific (see in-band vs. out-of-band discussion in my blog post). But bundle_cancel() is part of the language and knows nothing about the application logic. Thus, it can’t send the GS signal.

So let’s look at it from a different angle. What if there were a kind of poor-man’s-GS-signal that would be sent automatically by bundle_cancel() but, at the same time, there would be a way to overload it in the GS-aware coroutines?

That would mean that the parent of a GS-aware coroutine would have to GS it in a correct way (automatic GS signal vs. manual one). That sucks, but the good news is that the grandparent wouldn’t have to know. From this perspective it looks like GS signal could be a local contract between the parent and the child. It wouldn’t have any effect on the code beyond those two coroutines.

#15

Grrr. I’ve tried to prototype it and even that wouldn’t be sufficient.

The problem is that GS-unaware nodes wouldn’t propagate GS signal to their children.

If, on the other hand, language wanted to do that automatically, it would have to be aware of the all the parent-child relationships. And while that is a relatively reasonable expectation, libdill, at the moment, doesn’t have that info.

And even if it worked, there’s still the use case of GS-aware coroutine postponing GS of its children. This may not be the most common use case but I can imagine it could be needed. How would it accomplish that given that language runtime would propagate the signal behind the scenes?

#16

In Trio, we have the same issue with hard-cancel. (Which the language propagates automatically, behind the scenes, taking advantage of its knowledge of parent-child relationships.) We solve it with a construct called a “shield” – you can write:

with trio.open_cancel_scope(shield=True):
    # Code in here can't "see" cancellations coming from outside
    # until the shield is removed. Cancellations that originate
    # *inside* the shield still work, though.

I guess in libdill’s case, you could do something like set_current_coroutine_shield(SOFT) (vs NONE vs HARD)? Or make it an attribute on the bundle or something?

#17

Yes, I’ve seen that.

Still, the entire concept of behind the scenes distribution of the signal feels askew.

Let’s say we have the following call tree (same color scheme as before):

gs4

Say A enters GS mode. Suddenly, C starts doing GS while B is unaware of it. For B it looks like the C closed all by itself, without being asked to do so. Which is, kind of, like a violation of the contract between B and C.

All in all, it feels like reintroducing goto under a different name.

#18

Thinking about it some more, the claim about composability from my blog post still stands, although in a somehow limited manner.

Namely, if there is an uninterrupted chain of GS-aware components between the initiator of the shutdown (A) and the target coroutine ©, the target would gracefully shut down:

gs5

Most importantly, the GS process wouldn’t disrupt cancellation of non-GS-aware coroutines (D, F) in any way.

However, a GS-capable coroutine (E) owned by a parent which doesn’t care about GS (D) is just going to be HC-ed.

But maybe that’s the reasonable thing to do?

#19

Sorry to jump in, but I had some thoughts.

If B is not GS aware, what is B’s expected behavior when C exits (without an exception)?

  • B doesn’t know C exited. This is fine since HC will clean up B anyway.
  • B exits as soon as all of its children exit. B eventually exits which is what we wanted.
  • B detects that C exited and decides to spawn C again. C gets no work because hopefully we’ve stopped accepting new work in B’s ancestors. Both B and C get HC and exit.
  • B detects C exited and treats it like an error state, raising an exception or exiting with an error. This doesn’t seem like a resilient program unless that error is handled before it brings down the entire process. Hopefully the program author sees this crash in practice and adds appropriate handling in an ancestor during GS? I guess this is more of an escalation from GS to HC while the stack unwinds up to some parent, potentially hurting GS across the tree.

The last case is problematic for GS. It sucks that a single leaf in the tree could cause HC behavior across many other nodes, but that’s the point of structured concurrency - at least they exit and have an opportunity to clean up some resources.

Honestly the last case feels like a bug in B since the absence of an error from C on exit probably shouldn’t propagate out as an error - the author of B doesn’t really understand the contract with C.

If GS is an application construct, is has to be passable to every layer of your program. Every library needs to implement support for it and they all need to follow some standard conventions (or you need to add extra layers to convert between them). If D does not follow that convention, E will never get the signal and will get HC.

I like this suggestion:

I would expect every checkpoint within the graceful cancel scope to act as though the scope was cancelled. GracefulCancelScopes created after the parent cancel scope has entered graceful cancel will immediately raise a cancellation on the first checkpoint.

It would be nice if every cancel scope could have a graceful_cancel() method that would propagate like regular cancel() but only be raised within GracefulCancelScopes. That way you can control what parts of your program are gracefully shutting down.

Or maybe every cancel scope has a graceful property that can be set to True. If you are in a scope where graceful = True (which is also inherited from its parent scope), any checkpoint you hit will raise a cancel.

async def producer(queue):
    try:
        while True:
            with trio.GracefulCancelScope() as cancel_scope:
                # or set cancel_scope.graceful = True
                message = await conn.recv()
            if cancel_scope.cancel_caught:
                break
            await queue.put(message)  # don't want to cancel this on GS
    finally:
        # this will succeed on GS but will raise on HC
        await queue.put(STOP_SENTINEL)
        await conn.send_close()  

async def consumer(queue):
    while True:
        item = await queue.get()
        if item is STOP_SENTINEL:
            await queue.put(STOP_SENTINEL)  # make sure the other consumers get the message
            break
        # do stuff with item 

async def server(queue):
    # A case where you want to make sure the consumers drain the queue on GS
    with trio.open_nursery() as nursery:
        nusery.start_soon(producer, queue)
        for i in range(5):
            nusery.start_soon(consumer, queue)

You should also be able to set graceful = False to disable this behavior. It is a little like shield except that you can opt in again later.

async def first():
    with trio.GracefulCancelScope():
        # any checkpoint in second will cause a cancellation on graceful_cancel()
        await second()

async def second():
    with trio.CancelScope() as cancel_scope:
        # if this does not shield from the graceful scope above
        # then any checkpoint in third() will cause a cancellation on graceful_cancel()
        # use cancel_scope.graceful = False to disable
        await third()

async def third():
    # since this is graceful cancel aware,
    # it should be explicit about blocking graceful cancel
    with trio.CancelScope(graceful=False):
        with trio.GracefulCancelScope():
            trio.sleep(10)  # simulate a good graceful stopping point
        trio.sleep(100)  # simulate work you don't want to gracefully stop

In the interest of keeping things simple I think new cancel scopes should inherit the graceful bit. To intentionally handle graceful cancel within your code, you should block at the entrypoint when entering code that supports graceful cancel. From above, server becomes:

async def server(queue):
    # A case where you want to make sure the consumers drain the queue on GS
    with trio.open_nursery() as nursery:
        nursery.cancel_scope.graceful = False  # protect children since this is GS aware
        nusery.start_soon(producer, queue)
        for i in range(5):
            nusery.start_soon(consumer, queue)

# calls of server
async def worker():
    with trio.GracefulCancelScope():
        # if the nursery cancel scope within server does not disable the graceful bit
        # the consumers would just die instead of gracefully exiting
        await server(queue)  

async def main():
    with trio.open_nursery() as nursery:
        nursery.start_soon(worker)
        trio.sleep(10)
        nursery.graceful_cancel(10)

This keeps things simple for the runtime and libraries. If your code has no concept of graceful cancellation, the cancellation scope above you will determine if you are cancelled on graceful cancellation or not.

# if conn.recv() makes a new cancel scope, it must inherit the graceful bit
# otherwise any caller can't opt in to graceful cancellation when calling this coroutine
async def recv(self):
    with trio.move_on_after(TIMEOUT):
        # read from somewhere

In this way, the graceful bit is different from shield in that a graceful cancel scope should be cancelled if any scope above it calls graceful_cancel, even if there is a graceful = False scope in between.

#20

I am not sure I follow here. If B treated C exiting without an error as an error then there would be no way for a scope to finish without throwing an exception, no?

Yes, that’s what I’ve been saying: If B doens’t care about shutting down its children gracefully they will just be HC’ed. What you’ve asked for is what you get.

However, I believe people will be split about this and specifically, that they would be split along language lines. People with C and Golang background will favour the proposition above. People coming from higher level languages would want the GS signal to propagate down the tree automatically.

In fact, we may be replaying the entire exceptions vs. error codes discussion here under a different name.

One can think of GS propagation as a kind of exception that travels down the stack instead of up the stack.

And Python/Java/C++ people would use the classic pro-exception arguments: Programmers are lazy and careless and unless the exception is propagated automatically, they will never propagate it by hand.

And C/Golang people would resort to the well-known anti-exception arguments: Exceptions are creating a second, implicit workflow in the program. A workflow that is rarely used (and often not even properly thought about) and the unforeseen interactions between the two workflows will result in subtle, hard to catch bugs.

And maybe that’s not a problem. Each language would implement the GS signal propagation in a way that best aligns with its philosophy.

All of that being said, the balance of concerns is tilted a little bit differently for error propagation and for GS signal propagation: If error is not handled, that’s a bad™ thing and can corrupt the program in really bad ways. If GS is not handled, the worse that could happen is that the coroutine will be HC’ed. Which is also bad, but maybe a little less bad. You draw your conclusions yourself.

I won’t comment on the proposal because I am not very familiar with how Trio works, but you’ll still have to deal with the problems outlined in my blog post and in this thread: GS signal should be propagated in band. But the language doesn’t know what “in-band” means. Delivery of GS signal should not be disruptive (e.g. throwing an exception) – if it leaves an object in inconsistent state, the application code handling the GS would probably fail. Etc.

#21

Yes, exactly. The scope would, in effect, get HC because an exception ends the scope.

I think there is a compromise between in-band and out-of-band. In my examples with trio, one would opt-in to getting an early exception at points in the code where it is safe to stop work (accept loops, waiting on a recv).

In C, you had the suggestion of recv_from_socket_or_channel where you may want to interrupt a suspend on a socket when you get an in-band message from a channel. You explicitly opted-in to getting notified early, and if you get a message from the channel you know that your socket is still in a good state.

Whether that is signaled with an exception or the return of an explicit state is more of a language flavor question than whether the runtime should offer this kind of functionality - where you can opt-in to the early return of a suspended function so you can perform graceful cleanup.

I think the runtime should offer those early return mechanisms rather than libraries plumbing the signal thoughout their code.

In your example from the blog:

coroutine void nested_worker(message_t msg) {
    // process the message here
}

coroutine void worker(socket_t s, channel_t ch) {
    bundle_t b = bundle();
    while(1) {
        message_t msg;
        int rc = recv_from_socket_or_channel(s, ch, &msg);
        if(rc == ECANCELED) goto hard_cancellation;
        if(rc == FROM_CHANNEL) goto graceful_shutdown;
        if(rc == FROM_SOCKET) {
            bundle_go(b, nested_worker(msg));
        }
        rc = send(s, "Hello, world!");
        if(rc == ECANCELED) goto hard_cancellation;
    }
graceful_shutdown:
    rc = bundle_cancel(b, 20); // cancel the nested workers with 20 second grace period
    if(rc == ECANCELED) return;
    return;
hard_cancellation:
    rc = bundle_cancel(b, 0); // cancel the nested worker immediately
    if(rc == ECANCELED) return;
    return;
}

int main(void) {
    socket_t s = create_connected_socket();
    channel_t ch = channel();
    bundle_t b = bundle();
    bundle_go(b, worker(s, ch));
    sleep(60);
    send(ch, "STOP"); // ask for graceful shutdown
    bundle_cancel(b, 10); // give it at most 10 seconds to finish
    return 0;
}

What I think should happen in worker instead is that you always do graceful shutdown if you receive ECANCELED from recv_from_socket_or_channel. If this is a hard cancel the nested_worker will receive ECANCELED at any checkpoint in that code anyway so bundle_cancel(b, 0) is unnecessary.

The question is how do you opt recv_from_socket_or_channel in to early cancellation and make sure send is not opted in, because you don’t want to stop sending during early cancellation.

coroutine void nested_worker(message_t msg) {
    // process the message here
}

coroutine void worker(socket_t s, channel_t ch) {
    bundle_t b = bundle();
    while(1) {
        message_t msg;
        int rc = recv_from_socket_early_cancel(s, &msg);
        if(rc == ECANCELED) goto shutdown;
        if(rc == FROM_SOCKET) {
            bundle_go(b, nested_worker(msg));
        }
        // if you wanted to stop here too use send_early_cancel(s, "Hello, world!") instead
        rc = send(s, "Hello, world!");
        if(rc == ECANCELED) goto shutdown;
    }
shutdown:
    // cancel the nested workers with 20 second grace period
    // if this is a hard cancel, workers will hard cancel themselves anyway
    rc = bundle_cancel(b, 20);
    if(rc == ECANCELED) return;
    return;
}

int main(void) {
    socket_t s = create_connected_socket();
    channel_t ch = channel();
    bundle_t b = bundle();
    bundle_go(b, worker(s, ch));
    sleep(60);
    // this will make all early cancel checkpoints return ECANCELED
    // after 10 seconds all checkpoints will return ECANCELED
    bundle_cancel(b, 10); // give it at most 10 seconds to finish
    return 0;
}

I’m not sure if you really want to make a separate _early_cancel function for every coroutine. Maybe a way to signal the runtime or bundle that you’re going into an early cancel state? I’m not familiar with libdill so you’ll have to forgive me.

coroutine void worker(socket_t s, channel_t ch) {
    bundle_t b = bundle();
    while(1) {
        message_t msg;
        ENTER_EARLY_CANCEL()
        int rc = recv(s, &msg);
        EXIT_EARLY_CANCEL()
        if(rc == ECANCELED) goto shutdown;
        if(rc == FROM_SOCKET) {
            bundle_go(b, nested_worker(msg));
        }
        rc = send(s, "Hello, world!");
        if(rc == ECANCELED) goto shutdown;
    }
shutdown:
    // cancel the nested workers with 20 second grace period
    // if this is a hard cancel, workers will hard cancel themselves anyway
    rc = bundle_cancel(b, 20);
    if(rc == ECANCELED) return;
    return;
}

Now the real challenge is can you be in a consistent state if a parent calls ENTER_EARLY_CANCEL and some child or grandchild coroutine now gets ECANCELED where they normally expect only hard cancel? If they aren’t graceful aware, they’re going to clean up whatever they can and return back to you. Maybe it isn’t safe to assume that things are consistent anymore and you should just exit (it was in the middle of protocol negotiations on the socket), but then again I’d argue you shouldn’t call ENTER_EARLY_CANCEL unless you know the code you’re calling handles it properly.

I would probably suggest that you should always try to be in a consistent state even in the hard cancel case. Don’t immediately goto shutdown, try to finish the handshake. In the HC case every checkpoint will immediately return ECANCLED so the only thing you lose by continuting is potentially expensive CPU operations (e.g. calculating a shared secret). When you should not be cancelled early, you setup guards ENTER_NO_EARLY_CANCEL/EXIT_NO_EARLY_CANCEL around your critical sections.

#22

I implemented an example of this graceful cancellation behavior in trio: https://github.com/python-trio/trio/pull/941.

Here is a working example of the behavior modified from the examples earlier in the thread.

import trio
import logging
from functools import wraps

logging.basicConfig(format='%(asctime)s %(message)s', level=logging.DEBUG)


def log_entry_exit(func):
    @wraps(func)
    async def wrapper(*args, **kwargs):
        logging.info('entering %s', func.__name__)
        try:
            return await func(*args, **kwargs)
        finally:
            logging.info('exiting %s', func.__name__)
    return wrapper


@log_entry_exit
async def first():
    with trio.CancelScope(graceful=True):
        # any checkpoint in second will cause a cancellation on graceful_cancel()
        await second()


@log_entry_exit
async def second():
    """
    There are 4 variations worth exploring here:

    Cancel scope where shield is True and graceful is False. In this
    example third() will be protected from a graceful cancel in main(),
    even though there is a graceful cancel scope in third().

    We remove the shield after returning from third() so that
    the cancel from main() will kill us instead of waiting for
    100 seconds.

        with trio.CancelScope(shield=True, graceful=False) as cancel_scope:
            await third()
            logging.info('turning off shield')
            cancel_scope.shield = False
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit after about 3 seconds
            logging.info('done second sleep for 100')


    Cancel scope where shield is True and graceful is True. This
    example is similar to the above, except that `sleep(100)`
    will be immediately cancelled because the shield has been
    removed and this is a graceful cancel scope.

    If graceful is None, the behavior will be the same because it is
    inherited from the parent scope in first().

        with trio.CancelScope(shield=True, graceful=True) as cancel_scope:
            await third()
            logging.info('turning off shield')
            cancel_scope.shield = False
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit immediately
            logging.info('done second sleep for 100')


    Cancel scope where graceful is False. In this example the graceful
    scope in third() will end after 3 seconds, followed by a 10 second
    wait in third() and then second() will sleep for 10 seconds before
    being cancelled by main().

        with trio.CancelScope(graceful=False):
            await third()
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit after 10 seconds
            logging.info('done second sleep for 100')


    Cancel scope where graceful is True or None. This example is similar
    to the previous example, except that sleep(100) will exit immediately.

        with trio.CancelScope(graceful=True):
            await third()
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit immediately
            logging.info('done second sleep for 100')

    """
    with trio.CancelScope(graceful=False):
        await third()
        logging.info('second sleep for 100')
        await trio.sleep(100)  # will exit after 10 seconds
        logging.info('done second sleep for 100')


@log_entry_exit
async def third():
    # since this is graceful cancel aware,
    # it should be explicit about blocking graceful cancels from outside
    with trio.CancelScope(graceful=False):
        with trio.CancelScope(graceful=True):
            logging.info('third in graceful scope sleep for 10')
            await trio.sleep(10)  # simulate a good graceful stopping point
        logging.info('third exited the graceful cancel scope')
        logging.info('third ungraceful sleep for 10')
        await trio.sleep(10)  # simulate work you don't want to gracefully stop
        logging.info('third finished ungraceful sleep')


@log_entry_exit
async def accept_loop():
    while True:
        with trio.CancelScope(graceful=True) as cancel_scope:
            # simulate accept
            logging.info('accept_loop sleep 2 - simulate accept')
            await trio.sleep(2)

        if cancel_scope.cancelled_caught:
            logging.info('accept_loop accept cancelled, break loop')

            # do cleanup behavior for accept cancellation
            # this will immediately exit for hard cancel
            # e.g. send close to a network peer
            logging.info('accept_loop sleep 1 - simulate graceful close')
            await trio.sleep(1)

            break

        # simulate handling request
        logging.info('accept_loop sleep 5 - simulate handling request')
        await trio.sleep(5)

    # general cleanup could go here


@log_entry_exit
async def main():
    async with trio.open_nursery() as nursery:
        logging.info('nursery start_soon accept_loop')
        nursery.start_soon(accept_loop)
        logging.info('nursery start_soon first')
        nursery.start_soon(first)
        logging.info('nursery sleeping for 3')
        await trio.sleep(3)
        logging.info('nursery calling graceful_cancel')
        nursery.cancel_scope.graceful_cancel(20)


if __name__ == '__main__':
    trio.run(main)

Example output

2019-02-19 17:14:47,435 entering main
2019-02-19 17:14:47,436 nursery start_soon accept_loop
2019-02-19 17:14:47,436 nursery start_soon first
2019-02-19 17:14:47,436 nursery sleeping for 3
2019-02-19 17:14:47,436 entering accept_loop
2019-02-19 17:14:47,436 accept_loop sleep 2 - simulate accept
2019-02-19 17:14:47,436 entering first
2019-02-19 17:14:47,437 entering second
2019-02-19 17:14:47,437 entering third
2019-02-19 17:14:47,437 third in graceful scope sleep for 10
2019-02-19 17:14:49,441 accept_loop sleep 5 - simulate handling request
2019-02-19 17:14:50,442 nursery calling graceful_cancel
2019-02-19 17:14:50,442 third exited the graceful cancel scope
2019-02-19 17:14:50,443 third ungraceful sleep for 10
2019-02-19 17:14:55,445 accept_loop sleep 2 - simulate accept
2019-02-19 17:14:55,446 accept_loop accept cancelled, break loop
2019-02-19 17:14:55,446 accept_loop sleep 1 - simulate graceful close
2019-02-19 17:14:56,447 exiting accept_loop
2019-02-19 17:15:00,445 third finished ungraceful sleep
2019-02-19 17:15:00,445 exiting third
2019-02-19 17:15:00,445 second sleep for 100
2019-02-19 17:15:10,444 exiting second
2019-02-19 17:15:10,444 exiting first
2019-02-19 17:15:10,444 exiting main

Edit: Fixed a bug where if you modified the graceful attribute in a finally block to False, the CancelScope would fail to catch the Cancelled exception in __exit__.

Example:

@log_entry_exit
async def accept_loop():
    try:
        with trio.CancelScope() as cancel_scope:
            try:
                while True:
                    logging.info('accept_loop setting graceful to True')
                    cancel_scope.graceful = True

                    # simulate accept
                    logging.info('accept_loop sleep 2 - simulate accept')
                    await trio.sleep(2)

                    logging.info('accept_loop setting graceful to False')
                    cancel_scope.graceful = False

                    # simulate handling request
                    logging.info('accept_loop sleep 5 - simulate handling request')
                    await trio.sleep(5)
            finally:
                # be careful not to modify cancel_scope.graceful here otherwise
                # the Cancelled exception will escape this scope
                # do cleanup behavior for any kind of cancellation
                cancel_scope.graceful = False
                # e.g. send close to a network peer
                logging.info('accept_loop sleep 1 - simulate close')
                await trio.sleep(1)
    except BaseException as e:
        logging.error(cancel_scope.cancelled_caught, exc_info=e)

I have changed the implementation to not allow modifications of the graceful property. This limits some of the possible usage patterns but is necessary for correctness.