Graceful Shutdown

Yes, exactly. The scope would, in effect, get HC because an exception ends the scope.

I think there is a compromise between in-band and out-of-band. In my examples with trio, one would opt-in to getting an early exception at points in the code where it is safe to stop work (accept loops, waiting on a recv).

In C, you had the suggestion of recv_from_socket_or_channel where you may want to interrupt a suspend on a socket when you get an in-band message from a channel. You explicitly opted-in to getting notified early, and if you get a message from the channel you know that your socket is still in a good state.

Whether that is signaled with an exception or the return of an explicit state is more of a language flavor question than whether the runtime should offer this kind of functionality - where you can opt-in to the early return of a suspended function so you can perform graceful cleanup.

I think the runtime should offer those early return mechanisms rather than libraries plumbing the signal thoughout their code.

In your example from the blog:

coroutine void nested_worker(message_t msg) {
    // process the message here
}

coroutine void worker(socket_t s, channel_t ch) {
    bundle_t b = bundle();
    while(1) {
        message_t msg;
        int rc = recv_from_socket_or_channel(s, ch, &msg);
        if(rc == ECANCELED) goto hard_cancellation;
        if(rc == FROM_CHANNEL) goto graceful_shutdown;
        if(rc == FROM_SOCKET) {
            bundle_go(b, nested_worker(msg));
        }
        rc = send(s, "Hello, world!");
        if(rc == ECANCELED) goto hard_cancellation;
    }
graceful_shutdown:
    rc = bundle_cancel(b, 20); // cancel the nested workers with 20 second grace period
    if(rc == ECANCELED) return;
    return;
hard_cancellation:
    rc = bundle_cancel(b, 0); // cancel the nested worker immediately
    if(rc == ECANCELED) return;
    return;
}

int main(void) {
    socket_t s = create_connected_socket();
    channel_t ch = channel();
    bundle_t b = bundle();
    bundle_go(b, worker(s, ch));
    sleep(60);
    send(ch, "STOP"); // ask for graceful shutdown
    bundle_cancel(b, 10); // give it at most 10 seconds to finish
    return 0;
}

What I think should happen in worker instead is that you always do graceful shutdown if you receive ECANCELED from recv_from_socket_or_channel. If this is a hard cancel the nested_worker will receive ECANCELED at any checkpoint in that code anyway so bundle_cancel(b, 0) is unnecessary.

The question is how do you opt recv_from_socket_or_channel in to early cancellation and make sure send is not opted in, because you don’t want to stop sending during early cancellation.

coroutine void nested_worker(message_t msg) {
    // process the message here
}

coroutine void worker(socket_t s, channel_t ch) {
    bundle_t b = bundle();
    while(1) {
        message_t msg;
        int rc = recv_from_socket_early_cancel(s, &msg);
        if(rc == ECANCELED) goto shutdown;
        if(rc == FROM_SOCKET) {
            bundle_go(b, nested_worker(msg));
        }
        // if you wanted to stop here too use send_early_cancel(s, "Hello, world!") instead
        rc = send(s, "Hello, world!");
        if(rc == ECANCELED) goto shutdown;
    }
shutdown:
    // cancel the nested workers with 20 second grace period
    // if this is a hard cancel, workers will hard cancel themselves anyway
    rc = bundle_cancel(b, 20);
    if(rc == ECANCELED) return;
    return;
}

int main(void) {
    socket_t s = create_connected_socket();
    channel_t ch = channel();
    bundle_t b = bundle();
    bundle_go(b, worker(s, ch));
    sleep(60);
    // this will make all early cancel checkpoints return ECANCELED
    // after 10 seconds all checkpoints will return ECANCELED
    bundle_cancel(b, 10); // give it at most 10 seconds to finish
    return 0;
}

I’m not sure if you really want to make a separate _early_cancel function for every coroutine. Maybe a way to signal the runtime or bundle that you’re going into an early cancel state? I’m not familiar with libdill so you’ll have to forgive me.

coroutine void worker(socket_t s, channel_t ch) {
    bundle_t b = bundle();
    while(1) {
        message_t msg;
        ENTER_EARLY_CANCEL()
        int rc = recv(s, &msg);
        EXIT_EARLY_CANCEL()
        if(rc == ECANCELED) goto shutdown;
        if(rc == FROM_SOCKET) {
            bundle_go(b, nested_worker(msg));
        }
        rc = send(s, "Hello, world!");
        if(rc == ECANCELED) goto shutdown;
    }
shutdown:
    // cancel the nested workers with 20 second grace period
    // if this is a hard cancel, workers will hard cancel themselves anyway
    rc = bundle_cancel(b, 20);
    if(rc == ECANCELED) return;
    return;
}

Now the real challenge is can you be in a consistent state if a parent calls ENTER_EARLY_CANCEL and some child or grandchild coroutine now gets ECANCELED where they normally expect only hard cancel? If they aren’t graceful aware, they’re going to clean up whatever they can and return back to you. Maybe it isn’t safe to assume that things are consistent anymore and you should just exit (it was in the middle of protocol negotiations on the socket), but then again I’d argue you shouldn’t call ENTER_EARLY_CANCEL unless you know the code you’re calling handles it properly.

I would probably suggest that you should always try to be in a consistent state even in the hard cancel case. Don’t immediately goto shutdown, try to finish the handshake. In the HC case every checkpoint will immediately return ECANCLED so the only thing you lose by continuting is potentially expensive CPU operations (e.g. calculating a shared secret). When you should not be cancelled early, you setup guards ENTER_NO_EARLY_CANCEL/EXIT_NO_EARLY_CANCEL around your critical sections.

I implemented an example of this graceful cancellation behavior in trio: https://github.com/python-trio/trio/pull/941.

Here is a working example of the behavior modified from the examples earlier in the thread.

import trio
import logging
from functools import wraps

logging.basicConfig(format='%(asctime)s %(message)s', level=logging.DEBUG)


def log_entry_exit(func):
    @wraps(func)
    async def wrapper(*args, **kwargs):
        logging.info('entering %s', func.__name__)
        try:
            return await func(*args, **kwargs)
        finally:
            logging.info('exiting %s', func.__name__)
    return wrapper


@log_entry_exit
async def first():
    with trio.CancelScope(graceful=True):
        # any checkpoint in second will cause a cancellation on graceful_cancel()
        await second()


@log_entry_exit
async def second():
    """
    There are 4 variations worth exploring here:

    Cancel scope where shield is True and graceful is False. In this
    example third() will be protected from a graceful cancel in main(),
    even though there is a graceful cancel scope in third().

    We remove the shield after returning from third() so that
    the cancel from main() will kill us instead of waiting for
    100 seconds.

        with trio.CancelScope(shield=True, graceful=False) as cancel_scope:
            await third()
            logging.info('turning off shield')
            cancel_scope.shield = False
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit after about 3 seconds
            logging.info('done second sleep for 100')


    Cancel scope where shield is True and graceful is True. This
    example is similar to the above, except that `sleep(100)`
    will be immediately cancelled because the shield has been
    removed and this is a graceful cancel scope.

    If graceful is None, the behavior will be the same because it is
    inherited from the parent scope in first().

        with trio.CancelScope(shield=True, graceful=True) as cancel_scope:
            await third()
            logging.info('turning off shield')
            cancel_scope.shield = False
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit immediately
            logging.info('done second sleep for 100')


    Cancel scope where graceful is False. In this example the graceful
    scope in third() will end after 3 seconds, followed by a 10 second
    wait in third() and then second() will sleep for 10 seconds before
    being cancelled by main().

        with trio.CancelScope(graceful=False):
            await third()
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit after 10 seconds
            logging.info('done second sleep for 100')


    Cancel scope where graceful is True or None. This example is similar
    to the previous example, except that sleep(100) will exit immediately.

        with trio.CancelScope(graceful=True):
            await third()
            logging.info('second sleep for 100')
            await trio.sleep(100)  # will exit immediately
            logging.info('done second sleep for 100')

    """
    with trio.CancelScope(graceful=False):
        await third()
        logging.info('second sleep for 100')
        await trio.sleep(100)  # will exit after 10 seconds
        logging.info('done second sleep for 100')


@log_entry_exit
async def third():
    # since this is graceful cancel aware,
    # it should be explicit about blocking graceful cancels from outside
    with trio.CancelScope(graceful=False):
        with trio.CancelScope(graceful=True):
            logging.info('third in graceful scope sleep for 10')
            await trio.sleep(10)  # simulate a good graceful stopping point
        logging.info('third exited the graceful cancel scope')
        logging.info('third ungraceful sleep for 10')
        await trio.sleep(10)  # simulate work you don't want to gracefully stop
        logging.info('third finished ungraceful sleep')


@log_entry_exit
async def accept_loop():
    while True:
        with trio.CancelScope(graceful=True) as cancel_scope:
            # simulate accept
            logging.info('accept_loop sleep 2 - simulate accept')
            await trio.sleep(2)

        if cancel_scope.cancelled_caught:
            logging.info('accept_loop accept cancelled, break loop')

            # do cleanup behavior for accept cancellation
            # this will immediately exit for hard cancel
            # e.g. send close to a network peer
            logging.info('accept_loop sleep 1 - simulate graceful close')
            await trio.sleep(1)

            break

        # simulate handling request
        logging.info('accept_loop sleep 5 - simulate handling request')
        await trio.sleep(5)

    # general cleanup could go here


@log_entry_exit
async def main():
    async with trio.open_nursery() as nursery:
        logging.info('nursery start_soon accept_loop')
        nursery.start_soon(accept_loop)
        logging.info('nursery start_soon first')
        nursery.start_soon(first)
        logging.info('nursery sleeping for 3')
        await trio.sleep(3)
        logging.info('nursery calling graceful_cancel')
        nursery.cancel_scope.graceful_cancel(20)


if __name__ == '__main__':
    trio.run(main)

Example output

2019-02-19 17:14:47,435 entering main
2019-02-19 17:14:47,436 nursery start_soon accept_loop
2019-02-19 17:14:47,436 nursery start_soon first
2019-02-19 17:14:47,436 nursery sleeping for 3
2019-02-19 17:14:47,436 entering accept_loop
2019-02-19 17:14:47,436 accept_loop sleep 2 - simulate accept
2019-02-19 17:14:47,436 entering first
2019-02-19 17:14:47,437 entering second
2019-02-19 17:14:47,437 entering third
2019-02-19 17:14:47,437 third in graceful scope sleep for 10
2019-02-19 17:14:49,441 accept_loop sleep 5 - simulate handling request
2019-02-19 17:14:50,442 nursery calling graceful_cancel
2019-02-19 17:14:50,442 third exited the graceful cancel scope
2019-02-19 17:14:50,443 third ungraceful sleep for 10
2019-02-19 17:14:55,445 accept_loop sleep 2 - simulate accept
2019-02-19 17:14:55,446 accept_loop accept cancelled, break loop
2019-02-19 17:14:55,446 accept_loop sleep 1 - simulate graceful close
2019-02-19 17:14:56,447 exiting accept_loop
2019-02-19 17:15:00,445 third finished ungraceful sleep
2019-02-19 17:15:00,445 exiting third
2019-02-19 17:15:00,445 second sleep for 100
2019-02-19 17:15:10,444 exiting second
2019-02-19 17:15:10,444 exiting first
2019-02-19 17:15:10,444 exiting main

Edit: Fixed a bug where if you modified the graceful attribute in a finally block to False, the CancelScope would fail to catch the Cancelled exception in __exit__.

Example:

@log_entry_exit
async def accept_loop():
    try:
        with trio.CancelScope() as cancel_scope:
            try:
                while True:
                    logging.info('accept_loop setting graceful to True')
                    cancel_scope.graceful = True

                    # simulate accept
                    logging.info('accept_loop sleep 2 - simulate accept')
                    await trio.sleep(2)

                    logging.info('accept_loop setting graceful to False')
                    cancel_scope.graceful = False

                    # simulate handling request
                    logging.info('accept_loop sleep 5 - simulate handling request')
                    await trio.sleep(5)
            finally:
                # be careful not to modify cancel_scope.graceful here otherwise
                # the Cancelled exception will escape this scope
                # do cleanup behavior for any kind of cancellation
                cancel_scope.graceful = False
                # e.g. send close to a network peer
                logging.info('accept_loop sleep 1 - simulate close')
                await trio.sleep(1)
    except BaseException as e:
        logging.error(cancel_scope.cancelled_caught, exc_info=e)

I have changed the implementation to not allow modifications of the graceful property. This limits some of the possible usage patterns but is necessary for correctness.