Structured concurrency in Rust

Rust’s rayon library is something I use pretty frequently with the data-race-free support of scoped threads and parallel iterators, and can be used as a nice building block for more complex approaches. Its simplest application is by just pulling in the library, users of iterators can just switch my_collection.iter().map(... to my_collection.par_iter().map(... and have the work you’re performing with the iterator distributed to a threadpool that can access referenced data from the calling scope without allocating etc… I think the rayon::scope function may be particularly interesting as a building block for people interested in more complex structured concurrency approaches.

3 Likes

Can you use it in use cases where you don’t know the number of threads in advance? For example, how would a listen/accept loop look like?

Crossbeam also has a scope, very similar to rayon: https://docs.rs/crossbeam/0.7.1/crossbeam/fn.scope.html

They do have one important difference from a Trio-style nursery: if one of the threads panics, then the program just… keeps going until all the other threads finish. (And let’s hope the other threads don’t block waiting for the crashed thread to do something!) The problem is that since Rust doesn’t have any way to cancel a thread, there’s really nothing else it can do. Aleksey Kladov wrote a post about this: Exceptions vs Structured concurrency (discussion on r/rust)

I think it’s interesting to see all the different ways that people have ended up at these kinds of constructs, though. For Trio, I was just super frustrated about lost exceptions and designed nurseries to solve that problem, and then realized that I’d accidentally solved some other problems too. For Rust, the type system straight up forces you to find some way to tell it “this thread doesn’t live longer than this stack-allocated data”; otherwise it won’t let you pass the stack-allocated data into the thread. So it kind of forces you to invent `rayon::scope``. And then @qwwdfsad told me Kotlin came at it a different way again: https://twitter.com/qwwdfsad/status/1040898810816471041

The other interesting thing Rust is working on these days is the “futures” system. I don’t understand it very well, or know how that fits in with all this. My impression, though, is that it’s quite interesting and different in critical ways from the classic style of futures that I’m used to from python/javascript/C#. It did feel more complicated than it needed to be and just… weird about how it expressed things? But that might just be my bias from having my own way of doing things in mind, and not having internalized how Rust’s type system and idioms change things.

1 Like

Wild Aleksey Kladov appears

And let’s hope the other threads don’t block waiting for the crashed thread to do something!

This actually is not a problem in Rust most of the time. If you wait for another thread, you typically block on something like channel.recv() or mutex.lock(). If the thread holding the mutex/sending side of the channel panics, the resource is destroyed with RAII and poisoned, so the blocking thread unblocks with a panic as well. So this is a happy case actually, where cancellation is propagated.

The bad case happens when a thread blocks on something else. Like, if you do a blocking read, your are basically stuck.

:wave: Hello! Welcome!

And oh, that’s interesting about the poisoning. We’ve actually been debating whether to add poisoning to Trio’s channels. The downsides are that it adds some non-trivial complexity to the implementation, especially if you want to sorry multiple implementations of the channel API. Especially if you want some of those implementations to use existing protocols like Websocket. But, of course, we can get away with this because if we get stuck, we have a general cancellation operation we can use to unstick ourselves. I hadn’t thought before how poisoning could be a kind of object-specific substitute for cancellation.

We’ve actually been debating whether to add poisoning to Trio’s channels.

You might find these discussions somewhat useful then:

Lessons to be taken from channels in Go? · Issue #39 · crossbeam-rs/crossbeam-channel · GitHub
How to know if all the receivers has been dropped? · Issue #61 · crossbeam-rs/crossbeam-channel · GitHub
Panic on `send` in channels by default? · Issue #314 · crossbeam-rs/crossbeam · GitHub

One interesting conclusion from that discussion was that if you don’t implement poisoning of some sort, you end in a bad situation when thread A tries to send messages to a (dead) thread B, and that leads to deadlock (if channel’s buffer is finite) or OOM (if the channel is unbounded), which seem to be a worse outcomes than something like PosionedException.

Oh right… actually I was confused :-). Trio’s channels do have a closed state, which is tracked separately for each endpoint. Closing the send-side causes channel.receive to signal “end of channel”, and closing the receive-side causes channel.send to raise BrokenResourceError. This is modeled after how the common byte-stream APIs like TCP work. The debate we’re having about poisoning is actually about whether the sender should have a second way to close the channel that indicates an un-graceful shutdown, so the receiver doesn’t get a nice “end of channel” but rather something like a BrokenResourceError.

I guess the underlying statement stands, that poisoning is less crucial to avoiding deadlock when you have cancellation, and unexpected thread crashes automatically trigger cancellation of related threads.

(I had a thought about poisoning here: https://twitter.com/glaebhoerl/status/1072078852795625472)

The futures system in its basics looks actually pretty close to the trio concepts.

A task, represented as a struct/enum implementing Future, is logically a tree of nested structs/enums implementing Future that represent various active sub-tasks. This structure naturally corresponds to the cancellation/concurrency scope stack/tree. These futures are poll-driven (when a wakeup comes, the main future is polled and polls the currently active sub-futures (or just the one ready to do work in some cases) in turn.

A unit of cancellation here is a subtree that corresponds to some Future. For example, if we have something like

future::select(do_some_work(), timeout(Duration::from_secs(5)))

it is represented as a structure with two child futures. When any one of them completes, the other one is dropped and its Drop::drop implementation (finalizer) is called. The finalizer is synchronous, so normally it should not do any more async work. Due to the poll-based logic there’s no need to unwind the stack as the cancellation happens outside the future that should be cancelled. This can happen at any point where the child future is not running (yields control to the parent future / executor).

The description above, however, is the ideal case. tokio, the most popular executor, provides the ability to spawn independent tasks too and users (and some libraries) use that for various kinds of background workers, which do not fit the tree-based model.

1 Like

Yeah, there’s a consensus model of “futures” that you see in Python Twisted or asyncio, or C# or Java or Javascript. But Rust’s futures are totally different – they’re much more like ZIO’s monadic concurrency (see Zio (Scala library)).

My personal feeling is that the this adds a ton of complexity without adding much value – especially once you start adding special syntax to make futures more usable, so now you have redundant procedural-ish APIs and a monadic/functional-ish APIs for the same underlying features. Trio needs support from the language for suspending/resuming a call-stack, and then adds nurseries and cancel scopes as library features, and… that’s the whole machinery. The end result is roughly similar, but the complexity budget seems much smaller. This isn’t really related to structured concurrency per se though, it’s just a general comment :-).

Anyway, as far as structured concurrency goes, obviously tokio::spawn is inconsistent with it :-). But there’s also a more subtle thing I worry about:

Using drop for cancellation seems very worrisome to me, because it means that cancellation is unilateral and the code being cancelled cannot prevent or delay it, or report the result. This is very different from most mature systems I know, where cancellation is a three-phase process: (1) request, (2) wait, (3) report success/failure. For example: this is how Trio and ZIO work. @alanb is emphasizing the need to be able to delay cancellation in the Java Loom design:

Or here’s Russ Cox explaining part of the rationale for how Go channels work:

There almost always need to be two steps in a cancellation: a request for the cancellation and an acknowledgement that work has in fact stopped.

Well, that’s the argument from authority, but what concrete problems does it cause? Here are some fundamental cases where I’ve run into this:

  • Delegating work from async code to a thread: a very standard situation is that some async code needs to call into some legacy synchronous library, so it delegates that to a thread. Now your async code gets cancelled. What happens to the thread? You can’t cancel it synchronously, so in tokio I think the only option is to leak the thread…

  • Using Windows IOCP: this is sort of similar to the previous item – say you want to write to a file using IOCP. You call WriteFile, and this schedules the work to happen in the background (generally in some state machine running in the kernel rather than a thread per se, but it’s pretty similar from your point of view). IOCP has a powerful cancellation API, but it’s not synchronous: CancelIoEx submits a request to cancel a particular operation, but the cancellation doesn’t happen immediately, and you don’t find out until later whether the operation was actually cancelled or not. Again in tokio I think the best you can do is to submit the request and then let the operation continue in the background…

What this means in practice is: say you want to write to a file on disk without blocking the event loop. You have to use either a thread (most popular on all platforms) or WriteFile (on Windows, if you’re OK with some trade-offs that are irrelevant here). Then suppose your disk write operation is cancelled. In Trio, the write operation will either return successfully (the cancel request was too late, and ignored), or else it will return a Cancelled error (in which case the file wasn’t written). Either way, you know the state of the filesystem. In Tokio IIUC, the write operation will be abandoned to keep running in the background. It may or may not actually succeed, so the filesystem is now in an indeterminate state. And even if you read from the filesystem to find out what the state was, that doesn’t help, because the write might still be pending in the background and complete at some unknown future moment – which is a classic example of the kinds of problems structured concurrency is supposed to prevent :slight_smile:

I’m really not an expert on Rust and I have huge respect for the folks working on this stuff, so I could be missing something. But this worries me.

If I were going to give more arrogant, un-asked-for advice, I’d suggest:

  • Implement async functions as a pure coroutine-style feature, where async functions have a unique type that’s distinct from regular functions

  • Leave out await syntax, i.e., use regular function call syntax for calling async functions. None of the usual arguments for decorating every call site with await apply to Rust. The compiler should enforce that async functions can only be called by other async functions.

  • Write a library that implements Trio-style nurseries + cancel scopes on top of your new coroutines, with something vaguely like Trio’s cancellation protocol: https://trio.readthedocs.io/en/latest/reference-hazmat.html#low-level-blocking. Cancellation is reported as an error using Rust’s usual Result type. Nurseries automatically cancel their children if one panics or returns an Err, and then propagate the cancel/error.

  • Integrate the existing futures code by defining an async method on futures (like fut.wait() or similar), that translates between the two regimes (e.g. when it receives a cancellation request it drops the future).

(Kotlin did several of the things above – in particular leaving out await, and using an explicit method to translate between future-land and coroutine-land.)

3 Likes

I believe this is the futures-rs issue discussing the synchronous Drop question: https://github.com/rust-lang-nursery/futures-rs/issues/1278 [n.b. it’s been a while since I looked at it, and my understanding is limited]

A proposal for a Kotlin-inspired async design was discussed here (and in the linked, preceding thread): https://internals.rust-lang.org/t/explicit-future-construction-implicit-await/7344/1

(For what it’s worth, though I lack deep expertise, I also found that appealing. I also feel like it never had a realistic chance of being accepted. My – probably biased, reductive, and by now hazily remembered – sense of that situation was that (a) there is a vocal (and possibly also large) contingent of Rust users who are extremely averse to anything that smells like implicit or hidden control flow, (b) the people driving the async/await effort were not enthusiastic about the thought of shaking things up in any dramatic way (async/await was originally planned to be a part of the Rust 2018 edition), (c) they were also wary of diverging from the more popular and familiar C# model, (d) they felt that distinguishing async fn from fn() -> Future and having to explain the difference would be unjustified complication.)

1 Like

Back linked this discusson from that discussion :slight_smile: Hope it’s helpful and will note cause a flamewar

I also feel like it never had a realistic chance of being accepted. My … sense of that situation was that

This matches my perseption as well. The irony is that async/await is also blocked on selecting specific concrete syntax for await :upside_down_face:

I was one of the primary proponents of the “explicit async, implicit await” system, and I still love it. But I’ve since been convinced that it doesn’t work for Rust, and I’ll share a couple of the reasons that worked to do so in hopes that maybe you can understand the reasoning.

All of that said (below), it doesn’t have much effect on cancellation. Many cases do benefit from cancelling being Drop + not polling again. I suspect we’ll end up with a Cancellable + Future or similar to support the cases that need full cancellation and so it stays “choose your abstractions”. I’ll leave that to the futures team, though.

Implicit async is great, but comes with a lot of incidental complexity. How do you actually get implicit async? Well, let’s start with "immediately await! any impl Future".

This works for simple cases, like async |foo: fn() -> impl Future| { foo() }. But what about async |foo: impl Future| { foo }? The only consistent solution I came up with is also surprising: the first usage awaits it and thus makes it move, no matter how it’s used. This also is a very implicit await point (as opposed to a fn call which is a “normal” edge in control flow). Also, this means that just implementing a trait for a concrete type can change the control flow of a function that takes or ever generates it. So that solution is basically out.

The other main part of the proposal is that async fn only be callable in an async context. But this creates more problems itself: sync code has to transition to async at some point (typically via a task pool executor), so it has to be able to work with async types to be able to do so. So let’s handwave and say Future::await bridges an impl Future back to async and calling an async fn in a sync context creates a Future. All’s well and dandy, right? Well, not really.

There’s a reason people want async fn() -> T == fn() -> impl Future<Output=T> in usage (which is broken by making async context only auto-await async fn, which we’ve already conceded). And that reason is lifetimes and startup work.

Consider async fn(&str, &str) -> T. The desugaring of that is fn(&'a str, &'b str) -> impl (Future<Output=T> + 'c) where 'c: 'a + 'b. The output automatically captures all input lifetimes because the future state machine could use them at any point. By writing out the desugared version manually (which may include an async closure), the implementor can both dispatch some work immediately/synchronously and use such dispatch to support short-lived input lifetimes that don’t need to be captured for the entire impl Future lifetime.

This outright cannot be represented using async fn. If it could, you could make an argument for exposing fn() -> impl Future through a wrapper async fn { fn().await() }, but the fact is that this wrapper would both delay the synchronous work and thus extend the theoretically short input lifetime. You can’t even do async fn() -> impl Future because as we’ve already discussed, this requires Future::await because the auto-awaited type has to be unnameable, so the calling code has to do a .await() anyway.

At that point, if most big library utilities will require the .await() bringing call anyway (since they will have to deal with lifetimes like this), why not make await! a first-class construct and drop all this effort to make async fn more special than the bit of syntax sugar for capturing lifetimes to an impl Future?

On top of that, Rust’s already used most of its strangeness budget. Sticking with a somewhat familiar model of async/await rather than async/async will make it less strange (even if it would help avoid false non-parallelism).

Ah, but this is a radically different proposal than the one I wrote :-).

In my proposal, remember, async is just a marker for the compiler that you want to opt-in to a coroutine / green-threads-like system – there’s no built-in connection to the Future machinery. In this approach, the only distinction between a sync function call vs. an async function call is that during an async call, the coroutine runner has the option of temporarily preempting this green-thread and letting some other green-threads run.

In Python or JS, this distinction is very important. In these languages we have what you might call “fearful concurrency” ;-). The object model is a big soup of heap-allocated mutable objects full of pointers, so preemption is terrifying! Any time you get preempted, the entire world might shift under your feet. So we have Global Interpreter Locks, and when we need concurrency we like to run it all on a single thread, and make sure all the preemption points are explicitly marked with await, so we can at least keep that scary preemption out in the open where we can see it.

But in Rust, preemption isn’t scary at all! The whole language is designed around the assumption that your code will be running on multiple threads, and that the kernel might preempt you absolutely anywhere. And in fact our coroutines will be running on multiple threads, so the kernel can preempt our code absolutely anywhere. Async function calls add a second type of preemption – now the coroutine runner can preempt us too, not just the kernel – but… from the user’s perspective, we’re going from “we can be preempted anywhere” to “we can be preempted anywhere”, so who cares? The compiler cares about the difference between an async call and a sync call, because it has to handle the stack differently. But from the user perspective, the semantics are absolutely identical in every way that matters.

And this is also a generic response to all the other concerns you raised – if my async function proposal really had some showstopper problem with lifetimes or whatever, then that would mean that regular functions in threaded Rust had the same showstopper problem. But I’m pretty sure they don’t.

So… I’ve argued that IF Rust adopted a simpler async system that was just about compiler-supported coroutines, THEN the design would work fine. But the big question is whether it’s reasonable for Rust to adopt such a system at all :slight_smile: . And that gets into what @glaebhoerl was saying:

I don’t really believe that the Rust designers are scared to invent new ideas to make concurrency work better – that does not match my observations :-). But my proposal does involve basically deprecating the entire existing futures ecosystem. It would be a soft deprecation with interoperability and a migration path, but even so, that’s an absurd thing to contemplate, especially when it’s coming from some random yahoo like me who’s never written a line of Rust.

What I can say though is: ~4 years ago, Python had built up an ecosystem around explicit Future objects, and based on this decided to commit to a C#-style async/await design, where async functions are sugar for functions-returning-Futures. And … this wasn’t exactly a mistake, because in 2015 there was no way for us to know any better. But here in 2019, AFAICT everyone involved regrets that choice. We’re spending an immense amount of effort trying to dig back out from the decision, the community may still end up essentially deprecating the futures-based system (but now with another 4 years of built-up adoption and support commitments), and if we could go back in time we would absolutely do things differently. So it’s agonizing to sit and watch Rust apparently taking the same path.

I think there’s a nasty trap here for language designers, that turns our normal instincts against us. It goes something like this:

  1. We should support lightweight/async concurrency! But how?
  2. Well, jumping straight into adding language extensions is never the right idea, so let’s try implementing it as a library first.
  3. But it turns out that to describe async programs, we need all the same control flow options as regular code and more, but we can’t use our language’s normal control structures because those are all based around the stack, so instead we invent increasingly complicated systems with like, monadic lifting and combinators and stuff, so we’re basically inventing a secondary programming language awkwardly embedded inside our regular language.
  4. Eventually it becomes clear that yeah, the pure library approach is pretty awkward. We’d better add some syntax to clean it up.
  5. But now we have tunnel vision – originally, our goal was to support lightweight/async concurrency. Now, our goal is to make the monadic system more usable. So we end up with a complex Futures layer, and then language extensions on top of that to try to reign in the complexity. And it’s terribly easy to miss that once we have language extensions, we don’t even need the Futures layer at all! It only existed in the first place to work around our language’s inability to suspend a call stack.

If you don’t have a Future type exposed to the user, then you don’t have to worry about async functions that return Futures, or about what happens when a user adds impl Future to an existing type. There is no worry about different lifetimes for the code the executes directly inside the Future-returning function vs. inside the Future itself, because there is no such distinction. (It would be like worrying about the different lifetimes for code that sets up a regular function’s stack frame, versus the code that actually runs inside the stack frame.) You don’t have to worry about what happens when dropping a Future, and how that relates to other kinds of cancellation, because there are no Futures to drop.

(Well, there is a subtlety here: obviously you will need some kind of object inside your coroutine runner that represents a partially-executed coroutine stack, that owns all the stack-allocated objects. But (a) very very few people implement their own coroutine runners so it’s fine if that part of the API has extra complexity, and maybe requires unsafe or something, (b) coroutine runners should make the guarantee that once they instantiate a coroutine, they will always drive it to completion. This makes coroutines exactly like threads: once you create a thread, the kernel guarantees that it will keep scheduling it until it completes. So everything you know about lifetimes and threads transfers over directly.)

Anyway… I hope that at least clarifies why I still think implicit await makes sense for Rust, and why I think that the idea of moving away from Futures entirely is worth thinking about, even if it seems ridiculous on first glance.

4 Likes

Hi everyone,

I’m the person who wrote https://github.com/rust-lang-nursery/futures-rs/issues/1278 as quoted by @glaebhoerl above.

I had actually some more thoughts on Rusts Future model vs alternate approaches, and eventually wanted to write an article about it. Here’s me main ideas:

The most important choice for me is whether the lightweight tasks that are offered by a framework have run to completion semantics or not. With run-to-completion semantics, any given async function will run normally until a normal return point. This is equivalent to normal functions in most common programming languages. Lightweight tasks without run-to-completion semantics however can be stopped at any yield point from the outside. It’s not guaranteed that all code runs.

The distinction between those two models has some important consequences:

Support for IO completion operations

This is what is referenced in my ticket. With run-to-completion semantics, we can naturally wrap asynchronous operations that use IO completion semantics and are not synchronously cancellable. E.g. the mentioned IOCP operations require callers to hold a buffer alive until completion is signaled, and are not synchronously cancellable. This is something that isn’t naturally supported by Rust, which is is somewhat sad, since we don’t have zero cost abstractions here. Lots of asynchronous APIs (e.g. IOCP, boost asio) follow this model.

There are workarounds available, e.g. only use “owned” buffers, and pass the ownership to some kind of IO manager which drives the operation to completion even if the async task (Future) had already been dropped.

I think think this issue had probably not seen as much review as it should, since most pioneers for async programming in Rust looked only at Unix environments, with readiness based semantics (which work fairly well in Rust).

Cancellation

This might be a larger topic for the “Cancellation” topic. In general I think there are also 2 cancellation models:

  • Cooperative cancellation (which involves signalling and wait for termination steps as @njs also pointed out)
  • Hard cancellation (which means the parent task can cancel a child task without the child task doing anything)

Most environments will only support cooperative cancellation. This works fine in normally threaded environments, as well as in asynchronous environments. The hard cancellation model however only works in certain environments, e.g. in Rusts futures model. It is important to be aware that Futures allow both models. Dropping a Future will perform a hard cancellation on the subtask. This might have fatal consequences. E.g. if a task is cancelled while half of a long message had been sent towards a socket, the socket would essentially be in garbage state afterwards.

We can however also implement cooperative cancellation in Rust. E.g. I implemented something along CancellationTokens (or Go Context) via an async ManualResetEvent, which allows subtasks to wait for cancellation signals from a parent task and exit at well defined points. The code in the subtask looks something like:

async fn parent() {
    let event = LocalManualResetEvent::new(false);
    let task_a = async {
        let wait_for_cancellation_future = event.wait();
        let other_future = do_something_async();

        select! {
            _ = wait_for_cancellation_future => {
                // We got cancelled. Return gracefully
                return;
            },
            _ = other_future => {
                // The normal operation succeeded
            }
        }
    };
    let task_b = async {
        // Cancel the other task after 1s
        timer.delay(Duration::from_millis(1000));
        event.set();
    };
    join!(task_a, task_b);
}

So technically Rust implements a superset of cancellation compared to other approaches.
However there might be a certain bias for using the hard cancellation, since it’s so much easier to utilize.

This is also where I see the pros and cons of this: The hard cancellation in Rust comes mostly for free, and is super easy to utilize. For supporting graceful cancellation people have to add more extra code.

The downside is however that it might really be hard to reason about some code where the execution of a function stops somewhere in the middle. I don’t think most people would expect and, and have thought about the corner cases enough.

The downside of cooperative cancellation is that it has to be explicitly added on each layer, especially if there is no unified framework which can help with it. This is tedious and error prone. We can e.g. see how long it took for Go to add support for Context everywhere, and where it is still lacking. Adding cancellation support in that way is also really hard. E.g. cancelling an IOCP operation might require performing a callback when cancellation is first issued. Which again requires some storage space for storing the callback, which won’t be available in a fixed size context that is also suitable for embedded programs.

Efficiency

I think the run-to-completion model unlocks various optimizations that are not possible in the always cancellable model. E.g. I already elaborated about the IO completion operations above. Another example I encountered was in my implementation of an asynchronous ManualResetEvent in Rust, which stores task to wake up as raw pointers in an intrusive wait-queue. In a run-to-completion implementation (such as the one in cppcoro) the list of waiters can just be accessed without additional synchronization, since it’s guaranteed that the task would disappear as long as it’s blocked on the event. In my Rust implementation this can happen, since the Future can be dropped while it’s waiting to get completed. That required me to add an additional Mutex.

Applicability to programming language

I think the “stopping a task from the outside” model works somewhat reasonable for Rust - mostly due to the support for destructors. With most types implementing RAII semantics, even with an external cancellation the proper actions can be implemented. E.g. any (async) mutexes that might have been locked in the task will be automatically unlocked when the task is cancelled. And other resources would be released.

I think this model does not work at all in a language which does not support destructors, and e.g. relies on finally/defer blocks for cleanup. As far as I can see e.g. Go, Javascript and Kotlin all implement run-to-completion semantics. I’m actually not too familiar with Python to tell for sure.

I am not 100% sure about C++20 coroutines. From the examples that I have seen (e.g. in the usage of Lewis Baker’s cppcoro library, I think it also favors run-to-completion. However there are some destroy methods for coroutines, which afaik should be callable in suspended state.

Personal Summary

My current personal estimate is that both models work, and have some pros and cons. I think I might have preferred the run-to-completion approach for Rust too, since it seems to unlock a few other zero-cost abstractions and seems sometimes easier to understand. I also think it would have avoided the requiring the weird Pin<Pointer> type, which is super hard to understand and get right if one really relies on pinned data structures.

However we can probably build good software with the current design too. The benefit of the current design is that more average users will get at least some cancellation right, since there is less work involved for it.

The question whether await is explicit or implicit is for me actually of secondary importance. If we would have run-to-completion, it probably is a mostly syntactical differentiation. Without it, an explicit await seems preferably, since it at least shows users that a method might never resume beyond that certain call.

1 Like

A long time ago Rust used to have green threads that were removed in 2014, about half a year before 1.0. The motivation for the runtime system removal is written down in RFC 230. In brief, the green threading system had significant overheads related to stack management (memory overhead if a full stack is allocated for each task; performance costs for stack switching, FFI problems and fighting LLVM with segmented stacks) and was difficult to embed into non-Rust compared to native threading. We’ve gone a long way and need a new solution for async, but now we’ve got more problems.

One of them is that we (Rust users) want async to work everywhere it can work. We want it to work in tiny microcontrollers without setting aside a quarter of the total memory for the second stack. We definitely want it to work with WebAssembly and interoperate with Promises, we want it to be used when Rust code is embedded into a non-Rust application and so on. The big plus of the Future/async ecosystem as is is growing now is that it looks like function calls from the CPU side too. Suspending/resuming a call stack is something that’s almost trivial for an interpreted language like Python, but for a compiled language it requires its share of magic that may cause problems when you need to interoperate with other code inside a single application or in more unusual execution environments.

There are multiple crates out there that implement coroutines, but for now the futures ecosystem seems to be some orders of magnitude more popular.

I think this is basically exception safety? And is strongly reminiscent of the difficult situation in C++.

Oh, really? I thought that was more fundamental – the issue being not whether you poll() it to completion, but that it would be unsafe to move it between two of those poll()s.

Right. I guess one way to think about it is: normally in Rust there are two kinds of error-handling: panic! for contract violations and unrecoverable errors, and Result for ordinary/expected errors.

In most systems, cancellation is modeled as a normal error that async functions might return, and that the caller can then decide how to handle – log, do some special cleanup, propagate to caller, etc. In Rust, this would mean that cancellation was a Err, and then you could use the type system to track which operations are cancellable, based on their return type – if your error handling code forgets to think about cancellation then the compiler would complain about non-exhaustive match, all that stuff comes along for free.

Futures instead use a different model, where cancellation is instead a kind of quasi-panic!. It’s not exactly a panic – it doesn’t respect panic=abort, it can’t be caught with catch_unwind, it can only happen at yield points, it only unwinds to the point where the Future was dropped – but it’s similar to panics in that it’s a forced unwind that’s invisible to the type system. Or I guess you could think of it as an Err that’s invisible to the type-system, and every time you await! there’s an implicit ?, but the ? only applies to this one type? Either way, it’s unusual! [This is an example of how futures systems always seem to grow into a new language that handles common language issues like error handling and control flow in a quirkily different way from the host language.]

Now, there are definitely some benefits to the quasi-panic way of handling cancellation. The big huge problem that makes cancellation hard is that it cuts across abstraction layers – if I want to cancel this docker API call, then somehow that needs to get translated into cancelling a socket send, which is like 5 libraries down below where I’m working. This means that (a) everyone has to integrate with the cancellation system or it doesn’t work, (b) the cancellation system itself has to take responsibility for figuring out how far the cancellation should propagate; no individual layer can figure that out on its own. Most designs fail at one or both of these points. (And I think unfortunately your ManualResetEvent is doomed because of this.)

Rust’s Futures are like, one of the two designs I know of that actually solves both of these problems! It solves (a) by saying that all code in the Futures ecosystem simply must support cancellation, because it will happen whether it wants it or not. It solves (b) by using the point where a Future is dropped to “delimit” the extent of the cancellation – and there’s no way for code in between to mess this up by catching the cancellation early. This is really cool. If our only options were C#/Go-style cancellation tokens or drop-based cancellation, then you could make a pretty compelling case for drop-based cancellation.

But, those aren’t the only options :-). In my hand-wavy-fantasy-Rust, we would use cancel scopes. That article has the full details, but the basic idea is implicitly passed cancel tokens. In Rust I guess it would be something like:

let timeout = CancelScope(timeout=TIMEOUT)
let result = timeout.apply(|| {
    // arbitrary async code
})

This means: if TIMEOUT seconds pass, and the code inside the block is still running, then any cancellable operations start returning CancelRequestedError (waking up if necessary to do so). Like drop-based cancellation, this automatically works everywhere – no need to pass explicit Context objects around, and we integrate it with the core sleeping functions, so that every way of putting a coroutine to sleep automatically has to support being cancelled. And, like drop-based cancellation, it’s delimited – the timeout cancels all the code inside the timeout.apply block, and only that code.

Let me elaborate a bit on that last part. The way this works is, you need two different Err types: one type to represent “you’re inside an operation that’s been cancelled, you need to unwind”, which is the CancelRequestedError that can be returned by any cancellable operation. And a second type, maybe CancelSucceededError, to represent “that operation was indeed cancelled and has been unwound”. The only thing that returns this is CancelScope.apply. And CancelScope.apply’s logic is: if the block evaluates to CancelRequestedError, and this was the cancel scope that injected that error in the first place, then convert that into CancelSucceededError; otherwise, return the value unchanged.

This is the one downside I can think of compared to drop-based cancellation: you do have to teach people the difference between these two error types, and that if they get a CancelRequestedError, they should always propagate it outwards. I don’t think it’s a big deal, but it’s something.

OTOH, this approach also has several major benefits:

  • The unfortunate reality is that on common platforms like Unix and Windows, we have to be able to cope with operations that are uncancellable, or where the cancellation takes some time. Examples include delegating blocking work to threads (which is the only way to do async disk I/O on Unix), or IOCP (which is how you do everything on Windows). This approach supports these operations; drop-based cancellation can’t.

  • Cancellation is now integrated with the regular Rust error handling system, instead of an unusual “quasi-panic”.

  • You no longer need to work with explicit Future objects to cancel things. One of the core advantages of my hand-wavy-fantasy-Rust is the elimination of the explicit Future type and all the complexity that comes along with it, and this is a crucial enabler for that.

Of course for complicated problems like this it’s really hard to figure everything out without actually implementing it, so I could be missing something. But AFAICT the cancel scope approach is more powerful AND simpler to understand AND easier to implement. They’re also very cheap – I think just a few bytes per coroutine, a few bytes per CancelScope object, and in the uncancelled case a single highly-predictable branch per yield.

Right. But all those problems came from wanting to support yielding in any function, not just explicitly marked async functions. From the compiler’s perspective, turning explicitly-marked async functions into coroutines is exactly the same as turning explicitly-marked async functions into Futures – you have the same implementation techniques available, and the final code is basically the same. (Stepping a coroutine is the same thing as polling a future, etc.) The difference is in how they integrate with hand-written Futures. Sticking to pure coroutines has similar runtime and memory complexity, and much lower API complexity.

That’s all fine. Compiler-supported coroutines don’t need large stacks, they can interoperate with foreign APIs just as well as Futures, and they’re just as efficient as Futures in terms of both memory and CPU. Because they’re basically the same thing :-).

Right – if you don’t have compiler support, then coroutines are really inefficient (you need to allocate big stacks, like threads) and futures are the only viable path. If you do have compiler support, then they’re equally efficient, and coroutines are simpler and easier to work with.

That’s why async is such a trap for language designers: if you knew from the start that you were going to add syntax/compiler-support for async, then coroutines would be the obvious way to go. But generally by the time you’ve convinced yourself that you need syntax/compiler support, then you already have a large futures ecosystem, and you’ve trained yourself to think in terms of futures. It’s very difficult to step back and realize that as soon as you add async syntax, all your awesome futures APIs, which were important and tricky and crucial, suddenly become unnecessary bloat.

2 Likes

If we’re talking about coroutines with exactly the same limitations as Futures, the next obvious question is: what would we win by saying “There are no Futures, the return value of a coroutine call is some (magic dust) that can only be awaited or …”?

This change has some obvious drawbacks. At least it blurs the mental picture, because… for example, in

let timeout = CancelScope(timeout=TIMEOUT);
let result = timeout.apply(async || { /* arbitrary async code */ })

what type does the timeout.apply method receive? Obviously it has to be a coroutine, and the Rust-native way of saying “the argument must be a coroutine and the return value is its output” is

fn apply<C: Coroutine>(coroutine: C) -> C::Output;

…so now we’ve just renamed a Future into a Coroutine. I really doubt that Rust should go out of its way to make this kind of objects magically unnameable, given that even basic properties like “can be sent to other threads” are represented as named traits.

So now our Coroutine trait exists, but maybe we could forbid implementing it manually at all? Technically it could be done, but that will limit the ecosystem significantly because the building blocks will have to all live in std. join and select? WebAssembly and integration with Promises? Custom leaf futures for a microcontroller like waiting for a specific interrupt? Waiting for a POSIX signal? Now the standard library has to know about all of them. So IMO the furtherest point from the current would be an unsafe trait Coroutine that you can implement manually but need to keep the invariants yourself.

Now let’s look back at what we have and think again about what we’ve just said. There’s no radical difference between compiler-supported futures and compiler-supported coroutines. What does that mean?

That means that, realistically, we don’t need to break everything. The code that exists today either builds futures with combinators, or passes them directly to a executor, or implements futures manually. The first two cases don’t actually need any changes to work with a different cancellation model. The parts that would need changes are executor implementations, combinator implementations for the combinators that can drop an incomplete future, and the manually-implemented futures (primarily the ones that actually do drop any child futures before completion), as these are the only places that influence the run-to-completion versus run-while-polled difference in semantics. The leaf futures also need to change to be taught about the “cancellation pending” flag, but they are manually-implemented too, so they are already on this list.