Structured Concurrency Kickoff

#1

Hi everybody!

There are structured-concurrency-related efforts going on for different programming languages but the entire effort is kind of scattered, without people being aware of each other and speaking to each other.

If we had a common forum, we could share the use cases, the problems, the ideas and the solutions. Each of us, irrespective of which language they are working with, could benefit from this common pool of knowledge.

First, I wanted to create a cross-language mailing list to bring everyone together. I’ve even written a kick-off email. But then I realized that Nathaniel beat me by few days by creating this forum. So, what follows is my introduction to the topic.

What’s structured concurrency?

Structured concurrency is an extension of structured programming paradigm (goto considered harmful etc.) into the domain of concurrent programming. In very broad terms, the main idea is that physical layout of the program (on screen) should correspond to its execution flow (in time). When you violate that principle, as Dijkstra argues, you’ll get ugly spaghetti code.

For structured concurrency in particular, it means that the lifetime of a thread (I am using the term broadly to mean anything between a process and a coroutine) is bound to a particular syntactic construct, typically a scope or a code block.

To give a simplest possible example, a thread may be automatically canceled when a block is exited:

    {
        ...
        go foo()
        ...
    } // foo gets canceled here

There are different ways to explain why structured concurrency is desirable. The simplest one is to note that threads, as we know them today, don’t provide even slightest encapsulation guarantees: You call a function and once it returns you believe it is done. But unbeknownst to you it has launched a thread in the backgroud which is still running and causing mischief. This problem is particularly bad because of its transitive nature. To find out whether function launches a thread it’s not sufficient to examine the code of the function. You have to examine code of every single dependency, every dependency of a dependency and so on.

This line of though was exhaustively explored by Nathaniel Smith’s blog post “Notes on structured concurrency, or: Go statement considered harmful”:

https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful

(If you are not familiar with the topic and you want to read just one article on structured concurrency, this is your best choice.)

However, there are different ways to think about the problem. Specifically, the above approach focuses heavily on the language user’s point of view and doesn’t give much hints to the language implementer.

As an alternative, I find the metaphor of a “call tree” (as opposed to “call stack”) to provide much sharper focus on what structured concurrency actually means on the technical level: Same way as you wouldn’t consider unwinding a stack frame before all it child frames have exited, a thread shouldn’t exit before all of its child threads are finished.

I’ve explored this point of view in “Structured Concurrency in High-level Languages” article:

http://250bpm.com/blog:124

Implementations

This list may be incomplete but I am aware of implementations in:

C: http://libdill.org/
Python: https://trio.readthedocs.io/en/latest
Kotlin: https://kotlinlang.org/docs/reference/coroutines/basics.html#structured-concurrency (native!)

What have we learned so far?

  1. Thread lifetimes should be bound to scopes. Kind of.

While the example above looks nice and neat, it doesn’t really work. Consider the user case of a webserver accepting connections:

{
    l = listen(addr)
    while(...) {
        c = accept(l)
        go connection_handler(c)
    }
}

If the lifetime of the connection handler was bound to the innermost scope (scope of the while loop), as happens to be the case with reguar variables, each connection would be canceled when the while loop rolls over, i.e. immediately. Instead, we want the handlers to be canceled when the outer scope is exited.

That seems to imply we need two different kinds of scopes. Libdill calls these special scopes that cancel threads “bundles”. Trio calls them “nurseries”. In Kotlin it’s just plain “scopes”. It would be great if we could unify the terminology. Anyway, in this memo I’ll just call them “thread scopes”.

Here’s an example in Python:

    async with trio.open_nursery() as n:
        while ... :
            ...
            n.start_soon(connection_handler, c)
            ...
    # The handlers are canceled here.
  1. Thread scopes don’t necessarily live on stack.

It’s kind of like with traditional variables. In most cases we want them to live on stack (and be scoped accordingly) but once in a while one needs to do a dynamic allocation and thus escape the strict scoping rules.

Consider an example of an object that represents a network connection. It may contain a thread that sends keepalives. A function may want to create the object and pass it back to its caller. With simple data types (int) you can do that by copying the data to the caller’s stack frame. With threads this is not a good idea. Copying a stack of a running thread elsewhere is dangerous, breaks poiters to things on stack etc.

Ideally, there would be a way to create the thread scope on the heap, so that it doesn’t get deallocated when the local scope is exited:

Example in C:

    connection *new_connection(void) {
        connection *c = malloc(sizeof(connection));
        c->sock = connect(...);
        c->bundle = bundle();
        bundle_go(c->bundle, send_keepalives(c));
        return c;
    }

    void free_connection(connection *c) {
        close(c->bundle); /* The keepalive thread is canceled here. */
        close(c->sock);
        free(c);
    }

What are the open questions?

  1. The elephant in the room with structured concurrency, of course, is thread cancelation. With no thread cancelation there’s no structured concurrency. Period.

In theory, simple EINTR-like cancelation and a lack of asynchronous interrupts/signals is a sufficient foundation for building structured concurrency. In practice, this raises an entire host of questions: Can we guarantee such semantics with actual threads (as opposed to coroutines)? What about processes? What would be the exact semantics of turning async events into EINTR-like exceptions? What are the possible cancelation points? And doesn’t all that actually turn preemptively scheduled threads into a half-cooperatively scheduled Frankenstein monsters? And is that even desirable?

  1. How to handler errors?

The promise of consistent handling of errors in a concurrent environment is big part of the appeal of structured concurrency. But it’s not that easy.

Once again, Nathaniel Smith wrote an entire article on the topic:

https://vorpus.org/blog/control-c-handling-in-python-and-trio/

His claim is that structured concurrency is the only way to make Ctrl+C handling work in a way how users naively expect it to work.

And it’s easy to see why: If the exceptions are propagated up the call tree, freely crossing the thread boundaries, the KeyboardInterrupt exception is eventually going to reach the top of the call three (main function) and exit cleanly.

However, there’s a hidden race condition involved: Imagine two threads in a scope, one gets KeyboardInterrupt and the other one raises an unrelated exception. If the latter exception bubbles up to the parent thread faster it will cancel the thread that’s processing KeyboardInterrupt and Ctrl+C would get lost.

There are some rather deep philoosophical questions about the nature of errors here which I am not going to get into now. But the topic is definitely worth discussing.

  1. Who owns the thread scope?

Is the thread scope owned by some kind of external entity or is the ownership shared between all the threads in the scope?

The latter case means that each thread in the scope can close it, thus canceling its siblings. Is that a good idea?

I’ve written about the question here: http://250bpm.com/blog:139

  1. Timeouts.

How exactly are we going to handle timeouts? The most obvious way is the golang-context way where you just associate a deadline with a scope.

But the thing is much more complicated than that. Scope applies to all the contained threads (e.g. to all connections handled by a web server) but you may want to have different deadlines for each thread. For example you may want to close a connection that haven’t completed the initial handshake in 10 seconds.

Then there are grace periods. (“Close the server but give all connections 1 second to shut down cleanly.”) When the grace periods enter the picture the entire topic becomes even more complex.

I’ve tried to enumerate all use cases, but I am in no way sure I’ve covered everything:

http://libdill.org/structured-concurrency.html#what-are-the-use-cases

  1. Typed scopes.

Should all threads running in a scope of the same type? This is a weird theoretical question, but one that I find interesting. In a strongly types language one could make sure that one thread scope would hold a single kind of thread (e.g. user connections as opposed to admin connections).

    b = bundle<connection_handler>();
    bundle_go(b, connection_handler()); // ok
    bundle_go(b, database_maintenence()); // error!

Done this way it would give programmers a way to reason about logical groups of threads as of atomic entities and maybe even associate specific behaviours with each group (cancelation policy, deadlines etc.)

1 Like
#2

Thanks for your flexibility here, and this awesome kick-off post!

I want to say – I think the forum has some nice features, and it was easy to set up this way since we maintain the forum for Trio anyway, but if people feel uncomfortable with using a “project branded” space like this then please speak up. And to be 100% clear, the intention is that the “Structured concurrency” category here is totally open to any project, not specific to any one in particular.

So Trio is actually pretty fundamentalist here: it simply does not have any way to create a task without a nursery, or a nursery without binding it to a stack (using one of Python’s with blocks). Partly this is necessary because Trio is fundamentalist about never letting exceptions be accidentally thrown away, and if we allowed “heap-allocated” nurseries, then we would inevitably end up in cases where a task crashed but we had nowhere to re-raise the error. Andy partly it’s a tactical decision – when in doubt, do the more restrictive thing; even if it doesn’t work out, then you’ll at least learn something :slight_smile:

That means you have to structure code like this differently; in particular, you have to bound the lifetime of your object to some stack frame. It could potentially be some higher-up parent stack frame, but it has to be some stack frame. (This also answers the question you asked earlier, about why Trio allows nursery objects to be passed around, and whether that’s an anti-pattern – it’s certainly not something you want to do if you can avoid it, but it does make it possible to handle cases like these without relaxing our principles.)

There’s more discussion here:

Interestingly this part seems to be working out OK so far. When I wrote that post I used websockets as a hypothetical example, but we now have a real websocket library that uses this approach, and I haven’t seen anyone complain. A key part of this is that Python with blocks are extensible/composable, so the trio-websocket library defines its own with block to open a websocket, and that with block manages the nursery internally:

async with open_websocket_url('wss://echo.websocket.org') as ws:
    await ws.send_message('hello world!')

But you cannot do ws = open_websocket_url(...), because of the reasons you say.

Yeah, Trio has things easy here, since it’s an async / cooperatively-scheduled framework. This has two advantages:

  • Cancellation is one of those things that’s easy to add to a system if you do it at the beginning, but almost impossible to retrofit in later, since you end up having to basically audit all existing user code. Since we’re an async framework, we have to rewrite all the I/O routines anyway to route through our event loop, which makes it easy to add ubiquitous+uniform cancellation semantics at the same time.

  • We already have a mechanism to make the schedule points visible (async/await syntax). So we can re-use it to make cancellation points visible too.

There are other ways to make cancellation points visible to users – for example, if you have a type system to track what kinds of errors can happen (like in the Rust/Swift/Go style), then you can probably re-use this to keep track of cancellation too. But I don’t have any brilliant ideas on how to retrofit legacy libraries to make them tolerate cancellation.

Small point of clarification: I really mean, it’s the only way to make it work the way Python users naively expect it to work. Because Python made a decision long ago that control-C injects a magic KeyboardInterrupt exception, and Python users all learn this long before they start learning about concurrency. There are definite trade-offs to this design choice, and I don’t know if I’d repeat it or not in another language. But given that Trio is aimed at Python users, we wanted to keep things familiar.

Yeah, this is an interesting issue! There’s nothing really specific about KeyboardInterrupt here – you can have the exact same issue with any two exceptions that race against each other in concurrent tasks. And we’ve found that people do get bitten by this in practice: example 1, example 2, example 3.

There is an analogous thing that can happen with regular sequential code: if an exception handler crashes, then the resulting exception will tend to preempt the original exception. E.g. here:

try:
    raise ValueError("bad value")
finally:
    file_handle.clsoe()

…you won’t get a ValueError, you’ll get an AttributeError complaining that file handle has no method named clsoe. (Of course more static languages would catch this specific error at compile time, but I’m sure you can think of other situations where cleanup handlers might crash.)

Python 3 has a neat way to help debug cases like this, called implicit exception chaining: when our AttributeError preempts the ValueError, the original ValueError gets attached to the AttributeError as its “context”. Then when the traceback is printed, Python shows you both exceptions, and how they relate:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
ValueError: bad value

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
AttributeError: '_io.TextIOWrapper' object has no attribute 'clsoe'

So we have a plan to extend this system to handle cross-task exception preemption as well. In your example, the unrelated exception would “win”, but we’ll make a note in the traceback information that at the point where it propagated past this other task on the call tree, then it preempted a KeyboardInterrupt. (This is part of a larger redesign of how we represent cross-task exceptions; search that issue for __context__ to read about this part specifically.)

Yeah, IMO a timeout system needs to make it easy to apply timeouts to arbitrary operations – so the “targeting” system needs to be more fine-grained than a bundle/nursery or a task. And you need to handle nesting (because otherwise you lose encapsulation, and it makes Dijkstra sad.) This is the reasoning behind Trio’s “cancel scope” system – the connection establishment code can put a 10 second timeout on the handshake code without needing to know about any other timeouts that might be in effect. I assume you’ve seen this article before, but for those who haven’t, it goes into much more detail: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/

We have a long discussion of graceful shutdown here: https://github.com/python-trio/trio/issues/147

I think there really is value in having a “common vocabulary” for graceful shutdown, because otherwise it’s very difficult to build a complex application that might embed a third-party webserver etc., and still coordinate shutdown. We’re thinking about things like adding a “soft cancelled” state to cancel scopes, but haven’t settled on any one design for certain.

split this topic #3

A post was split to a new topic: Structured concurrency in Rust

split this topic #4

A post was split to a new topic: The ParaSail language

split this topic #5

A post was split to a new topic: Zio (Scala library)

Zio (Scala library)
#6

I split out some posts into new topics. My experience with discourse forums so far, splitting into topics like this seems to make things easier to follow, versus having different discussions intermingled with each other. It’s pretty easy to split topics like this though, so you don’t have to worry too much; if a discussion gets tangled up we can fix it afterwards :-).

#7

So Trio is actually pretty fundamentalist here: it simply does not have any way to create a task without a nursery, or a nursery without binding it to a stack (using one of Python’s with blocks).

I think we are both approaching the same problem here from different sides. Namely: How to spawn a thread that lives in the scope defined by your parent?

void foo() {
    bar()
}

void bar() {
    go(quux()); // this should be canceled when foo exits, but how?
}

Here’s the Trio’s solution:

def foo():
    with async trio.create_nursery() as n:
         bar(n)

def bar(n)
    n.start_soon(quux)

And here’s how you would do the same thing in libdill:

void foo() {
    int b = bar();
    close(b);
}

int bar() {
    b = bundle();
    bundle_go(b, quux());
    return b;
}

Now, both are functionally equivalent, but look at the ownership of the scope (nursery, bundle). In the former case it’s first owned by foo and the by foo and bar in parallel. In the later case there’s always exactly one owner. First it’s bar, then it passes it to foo but by that time bar is already not running. The swap of ownership is atomic.

Shared ownership results in some weird scenarios:

def foo():
    with async trio.create_nursery() as n:
         n.start_soon(quux)
         bar(n)

def bar(n)
    n.cancel()

Note how bar cancels quux which it should, as a naive person would argue, not even be aware of. In other words, it looks like a violation of encapsulation.

#8

Yeah, this is an interesting issue! There’s nothing really specific about KeyboardInterrupt here – you can have the exact same issue with any two exceptions that race against each other in concurrent tasks.

There actually may be something specific to KeyboardInterrupt here (or to a broader set of exceptions that KeyboardInterrupt is part of).

Consider an I/O error. If it happens after the thread has been asked to exit, who cares? From the user’s perspective, the thread is already dead and asking a program to report network outages that occured after it has been shut down doesn’t sound like a reasonable expectation.

Not so for KeyboardInterrupt though. If it’s dropped, it will cause entire program to misbehave. Maybe there’s a “native” scope for each exception? Like that KeyboardException should always go directly to the main thread?

#9

Well, sure, if you pass a nursery across an encapsulation boundary, then you have explicitly chosen to violate encapsulation :slight_smile:. You can also write the same thing in libdill:

void foo() {
    b = bundle()
    bundle_go(b, quux());
    bar(b);
}

int bar(int b) {
    close(b);
}

I think the gap here is that Python really doesn’t use “passing ownership” as an idiom. The closest analogue to “ownership” in Python is to use a with block on something. So if bar wants to create something, while enforcing that its caller takes ownership, then the way you do that in Python is define bar in such a way that you have to use it in a with block:

async def foo():
    async with bar():  # this can create a nursery
        ...
    # this dedent closes the nursery

And if we did things the way you suggest, with heap-allocated nursery objects whose ownership could be passed arbitrarily between functions, then we would lose many of Trio’s key advantages – no more automatic exception propagation (how do we know whether the exception should be sent to foo or bar?), no more using cancel scopes to delimit a cancellable operation (how do we tell which nurseries are inside the cancel scope if they’re just integers that can be moved around arbitrarily?).

OK, that’s fair, yeah. Since a KeyboardInterrupt in particular can end up in totally arbitrary parts of the program, that means it can potentially end up in parts that aren’t properly set up to handle an exception like it.

I think this is an example of a pretty general trade-off. In a “serious” program, you probably want control-C (and also SIGTERM) to trigger some kind of controlled shutdown. Maybe a graceful shutdown, definitely some kind of cancellation and unwinding. The standard Python thing of just materializing a KeyboardInterrupt at some arbitrary location cannot be made fully safe and predictable, in lots of ways. What if you were in the middle of a critical section? Trio tries its best to make it at least reliable enough to be useful in practice, but like, there’s no possible way you can write a test that an exception that can happen after any instruction is always handled correctly. This is just inherent in the idea of tossing an exception into an arbitrary place and crossing your fingers. So we have ways for serious programs to catch the control-C in a controlled way and then cancel everything, etc.

So KeyboardInterrupt is broken! …except. What if you have a buggy program? For example, one where there’s a task caught in an infinite loop and ignoring cancellation? In that case a controlled shutdown is impossible, and throwing in a KeyboardInterrupt grenade is likely to work pretty well. Or what if you have a quick script whose author never spent any time at all thinking about control-C or controlled shutdown? (And of course there’s a lot of overlap between the “quick script” and “buggy program” cases :-).) KeyboardInterrupt is theoretically wrong, but in practice it handles these cases pretty well.

#10

Slightly off-topic here, but yes in general this observation is correct
that Python’s way of handling Ctrl-C mostly works for small programs.
Larger programs with multiple threads and a mainloop should implement a
signal handler to catch it and shut down orderly without raising an
exception (usually signalling the mainloop).

So considering the KeyboardInterrupt exception as a way to make your
application shut-down orderly in a structured-concurrency environment
may be the wrong thing to try to achieve. Instead the signal should be
caught and the application can then cancel all
threads/nurseries/bundles/pools that it wants to cancel in order to
terminate cleanly.

#11

Both of Trio’s strategies for handling control-C really, really benefit from structured concurrency, though. By default, we do the usual Python thing of raising KeyboardInterrupt, and it works as well as it ever does. (I.e., not 100% reliably in theory, but basically just fine in practice.) This relies on our ability to propagate exceptions properly, which we get from structured concurrency. The KeyboardInterrupt arrives in some arbitrary task, and then as it propagates out it automatically cancels everything else, runs finally blocks etc. Or, if you register a signal handler, then structured concurrency makes it easy to cancel everything and shut down in an orderly way – call root_cancel_scope.cancel() and the whole program unwinds itself.

By comparison, Trio’s competitors like asyncio don’t have any useful default behavior – in fact a common reaction to control-C is for KeyboardInterrupt to be raised inside the mainloop’s guts, which corrupts its state and makes everything crash hard :frowning:. Or, if you do register a signal handler, it’s very difficult to figure out what all your different tasks/callbacks/etc. are doing so you can shut them all down in an orderly way.

There is one frustrating limitation with Trio though: if you have a program that uses trio.run_in_worker_thread(...) to call into some code that blocks for an indefinite period, then the program tends to freeze when you hit control-C :-(. The reason is that we have no generic way to cancel threads, and Trio relies on cancellation to unwind the program after control-C. This seems to be basically unsolvable in the general case, though you can do things like manually make the thread check for cancellation, and hopefully as Trio’s ecosystem grows then people will have less need to call into legacy blocking libraries like this.

#12

Fair enough. Both C or Python have their own shortcomings and special considerations which muddle the thinking about the problem. Let’s rather think of some kind of ideal language that has SC baked in:

thread_scope {
     go foo(); // lifetime is automatically bound to the enclosing scope
} // foo gets canceled here

Function can be an implicit thread scope:

void bar() {
     go foo(); // lifetime is automatically bound to the the lifetime of bar
} // foo gets canceled here

Now, that’s nice and easy. It makes it hard to shoot yourself in the foot. My ideal language should definitely support that kind of thing. But the original observation in this point was that it is not sufficient. And the example given was the socket object with a thread inside that’s returned from a function. We need a different syntax for that.

In the end I feel like these’s a case for having two different constructs.

This is hard to express in Python though, given that it (I guess) allocates everything on heap and uses GC to take care of lifetimes.

#13

I think Floris may be getting at the same point as I did: KeyboardInterrupt is special. If we handle it just like any other exception we end up with the weird corner cases where it is not respected (see my scenario in the original post).

Once you start thinking of it as something special some solutions pop up. For example, one could make KeyboardInterrupt “level-triggered”, i.e. once user presses Ctrl+C, every blocking function from that point on will end immediately with KeyboardInterrupt. But that doesn’t work if you want to give the server a grace period to shut down after Ctrl+C.

So, my thinking was: Can’t we route the Ctrl+C event always to the main thread?

It’s easy to implement in the language and it has the desirable properties – it never gets overshadowed by an exception from a sibling thread, given that main has no sibling threads.

The problem is that it looks weird and special. But in fact, it is not. Consider how network interrputs are handled. They are captured by the OS and routed to the socket that cares about that particular connection. And if we are doing interrupt routing anyway, why not simply route Ctrl+C to the main thread?

#14

As for the common vocabulary, that’s what I’ve tried to do, but as already said, I am not sure my list is exhaustive.

On the technical level, I would advice against making “soft cancel” part of the language. I’ve tried that and it resulted in a lot of complexity and even more importantly it made the entire “structured” thing much less obvious and intuitive.

Why not simply say that graceful shutdown is to be handled by the user. The user can open a channel to the thread, send it a “shutdown in 10 secs” message, then wait for 10 secs and cancel it by exiting the scope. The thread, in turn, would now know that it’s supposed to exit in 10 secs, but it doesn’t have to be too paranoid about it: If it doesn’t comply it will be canceled by the parent anyway.

EDIT: Actually, this is a problem I’ve banged my head against for a year or so before I realized how to solve it. I’ll write a separate post about it.

#15

I actually like the reified lifetimes that nurseries give you. I just posted some of the reasons why in the ZIO thread. And of course you also need them for objects that encapsulate a nursery. E.g. in Trio you write:

async with open_websocket("https://...") as ws:
    message = await ws.receive()
    await ws.send("reply")

Here the async with open_websocket internally opens a nursery, whose scope extends over the open_websocket's block. The ws object internally holds a reference to this nursery, so it’s helpful that it is an object :-). But code inside the async with block doesn’t have any way to access this nursery directly. (For example, it can’t spawn new tasks into it.) So reified lifetime objects are really useful for encapsulation and abstraction.

Also note that you can construct a reified nursery from an implicit nursery, though it’s kind of awkward:

# In a made-up language with `go` statements that are implicitly scoped to the
# surrounding function
def pseudo_nursery_manager(nursery):
    while True:
        thunk = nursery.receive()
        go thunk

Now whenever I want a nursery object that I can pass around, I write:

def uses_pseudo_nursery():
    nursery = open_channel()
    go pseudo_nursery_manager(nursery)
    # These both get to spawn tasks into my psuedo-nursery by
    # sending on the channel
    foo(nursery)
    bar(nursery)

Doing things this way is clunky and awkward, but you haven’t actually stopped people from shooting themselves in the foot if they really try :-).

Anyway… disallowing “dynamic” / “heap-allocated” nurseries actually works pretty well for us. And you have to admit: heap-allocated nurseries are a wishy-washy compromise that lets unstructured control-flow leak into your language. Have the courage of your convictions :wink:

I think the key thing you need to make Trio’s approach practical is to have a language that lets users define their own “block types”. So like in Python, anyone can invent a new kind of with block. And that’s what allows open_nursery to be encapsulated inside some user-defined function like open_websocket – they just have to make their function a with block.

In many modern languages, the way you would do this is instead to use some kind of closure/block syntax. Like in JS you’d probably make the primitive

withNursery(nursery => {
    # ... code that uses nursery ...
})

And then for the websocket, you’d do:

withWebsocket(url, ws => {
    # ... code that uses ws ...
})

There are similar idiomatic features in Ruby, Swift, Rust, etc. I think this is the key feature that C is missing, that’s making your life difficult.

This is actually what Trio does when it gets a Ctrl+C at an awkward time where it can’t deliver it immediately – it sets a flag, and then uses its cancellation system to inject the KeyboardInterrupt into the main task at the next available opportunity. But, this isn’t for the reason you suggest :-). Trio does it this way because it needs to deliver it to some task, and the main task is guaranteed to always be there, so it’s a convenient choice. But it doesn’t help with the issue you’re thinking of: in Trio the main task is not really special with respect to exception handling, and the KeyboardInterrupt could still get lost, if one of the main task’s children crashes while the KeyboardInterrupt is propagating.

Oh sorry, when I said “common vocabulary”, I meant, a generic way for all Trio apps to talk about it – so if my app has an embedded HTTP server, an embedded websocket server, and something else, and they’re all written by different third parties, then it’s very helpful if there’s a standard uniform way to say “All right all of you, do a graceful shutdown”.

I’d like to hear more about this! I was thinking it seemed like a pretty small and natural extension, that fits naturally with the “structured thing”; since we already have a way to deliver a cancellation at a branch of the task tree, extend that mechanism to deliver soft-cancellations as well.

We’ve experimented with “user space” implementations of this. But a channel is pretty awkward here. Take our accept loop:

while True:
    conn = listener.accept()
    nursery.start_soon(handler, conn)

The accept call might block indefinitely. But when the graceful shutdown is requested, we want the accept call to exit immediately (while any handlers are allowed to keep running of course). So if we use a channel for this, then it means we need some kind of accept-a-socket-or-else-receive-from-a-channel operation, which is really difficult. (For regular OS sockets it’s possible, if you go all concurrent ML, but that’s a whole pile of complexity that you don’t need just for this use case, and it doesn’t necessarily work for cases where listener is more complicated than a bare OS socket.)

Instead, we’d write:

while True:
    with cancel_if_graceful_shutdown_requested as cancel_scope:
        conn = listener.accept()
        nursery.start_soon(handler, conn)
    if cancel_scope.was_cancelled:
         # Graceful shutdown requested
        break

Looking forward to it :slight_smile:

1 Like
Graceful Shutdown
split this topic #16

A post was split to a new topic: Project Loom – lightweight concurrency for the JVM

split this topic #17

2 posts were split to a new topic: Thread locals and dynamic scoping

#19

Uh. I looked at how KeyboardInterrupt works in Python and it doesn’t look like it can be interecepted and routed to the main thread. That makes the entire discussion mute. Other languages may try this approach though.

(An interesting insight here is that this cannot be done without structured concurrency because unless the program is structured there’s no concept of main thread – all threads are equal and therefore there’s no obvious candidate to handle Ctrl+C events.)

EDIT: After even more investigation, it seems that it’s possible to install a custom interrupt handler in Python:

signal.signal(signal.SIGINT, signal_handler)

So maybe the signal can be re-routed to the main thread after all.

And, oh, maybe the exception thrown in the main thread should be Cancelled rather than KeyboardInterrupt. That would make the main thread the same as all other threads: It can be canceled only by its “parent”, which, in this case, is the user pressing Ctrl+C.

#20

Terminology is a problem and “scope” is already overused.

Yes, that really sucks. We should come up with something less generic than “scope” or “bundle” and less arbitrary than “nursery”. Also, being a core concept the name should be two or less syllables long.

After spending some time browsing the thesaurus I’ve found “twine”. It represents the concept well (it’s a collection of threads), it’s 1 syllable long and most importatnly, the name is not already taken. Would that work for other people here?

There are a few other concepts such as nesting and a “non-cancellable” scope to shield fibers from cancellable during recovery, cleanup or critical operations. This may be relevant to some of the discussion here.

I’ve made a separate post about this topic here (Graceful Shutdown) but I think that what I’ve wrote is not easy to grasp. I’ll try to write a blog post about the topic or something.

The terminology bikeshed thread
#21

I just split off a few more threads:

Feel free to edit titles if I got them wrong.