Zio (Scala library)

This forum is exciting to see! I have done some work on a Swift library that is loosely inspired by Scala’s Zio. I came across the Notes on Structured Concurrency at the time and have been very interested in the topic since then.

I’m looking forward to learning from the discussions on this forum. I am especially interested in developing a more firm understanding of the tradeoffs involved in design decisions such as those discussed in this thread.

Since Zio hasn’t been mentioned here yet, I’ll provide a brief overview of its design with respect to structured concurrency. It uses a concurrency model based on interruptible fibers. Scoping is supported via a supervision operator which guarantees that dangling fibers are interrupted when the supervised scope exits. In that respect it is quite similar to structured concurrency, however it is implemented as part of a pure functional IO monad. Those who are interested can find an overview of Zio’s design here: https://m.youtube.com/watch?v=wi_vLNULh9Y.

Regarding open question #3, Zio’s forked fibers run as long as there is a demand for the value they produce. Errors in a scoped fiber have no direct effect on other peer fibers in the same scope. However, handling of the error could result in interruption of other fibers or elimination of demand for their values (resulting in in interruption).

On top of this foundation, Zio provides operators with more structured behavior. The par operator runs several fibers concurrently, providing a tuple of results to the continuation. When one of the fibers fails the others are immediately cancelled and the error is propagated. The race operators are similar, however instead of waiting for all fibers to complete they return a result of the first fiber to complete and interrupt the others. As with par, if a fiber fails all other fibers are interrupted and the error is propagated.

Regarding open question #4, Zio’s timeout operator is built on its race operator. A fiber races with a sleeping fiber that completes when the timeout expires. This approach allows timeouts to be specified as granuarly as one wishes at any layer in the computation.

A scheduling feature was recently added to Zio. This adds a family of primitives and combinators that together allow very sophisticated scheduling that can handle use cases such as repeated tasks, retry with jittered exponential backoff, etc. An overview of the scheduling operators appear in the second half of this talk: https://m.youtube.com/watch?v=onQSHiafAY8.

2 Likes

(This was split off from Structured Concurrency Kickoff, so the “question 3”, “question 4” stuff is referring to @sustrik’s list in the first post there.)

Thanks for the link! I was vaguely aware of Zio but had not found the right reference to figure out how the big ideas fit together. Impressions from watching the video:

Well… there’s a tremendous amount of machinery used for static typing and the pure functional monad part (you don’t write a program, you write a program that returns a program, etc.). Of course this way of approaching things has benefits; I’m just not sure I followed all of it :-).

But if you squint past that, I think the core concurrency primitives are:

  • A very traditional go-like operator (they call it fork) that returns a “fiber handle”, and then you can use this handle object to join or interrupt the “fiber”. Conceptually I think the only way this really differs from a JS Promise is that it supports cancellation, and it’s pretty much isomorphic to a Twisted Deferred or asyncio Future (which do support cancellation), or even to a POSIX thread if we pretend POSIX thread cancellation is usable.

    • But they do go beyond these systems by thinking through cancellation in a more comprehensive way – operations can register handlers for how they handle a cancellation, you can shield specific bits of code from cancellation, etc. I saw a lot of parallels between the practical details of their cancellation implementation and the practical details of Trio’s cancellation implementation, which is reassuring that we’re both on the right track :-). Their public API is a bit inside-out from how Trio does it, but the end effect is very similar. (They tie cancellation to fibers, and then make putting an arbitrary computation into a fiber something that’s cheap and can be done retroactively; Trio makes cancellation of arbitrary computations a first-class concept, and then makes the fiber mechanism aware of it.)
  • And there are a bunch of nice higher-level tools built on top of the bare “fiber” concept, that carefully manage the fiber handles and interrupt things appropriately. For example race is implemented using the public fiber APIs, and manually keeps track of which fibers are running, handles errors from them, cancels the loser, etc. Having done all that, they end up with something whose semantics looked pretty safe and sensible to me. The talk makes a big deal about most other implementations of race not getting these details right, which sounds fair.

  • And then you have supervise, which lets you manually demarcate a chunk of code, and when you get to the end of that code, if there were any fibers that it created but then leaked, those fibers are automatically cancelled.

So… this is a really interesting example to me. They seem to be worrying about the right things, and I get the feeling that you can indeed use this system to effectively solve lots of real problems. But… the one thing they’re missing is exactly the stuff from the “go statement considered harmful” article. And if they had that, I think it would make their design substantially simpler and better behaved at the same.

There are two key differences between this approach and how Trio does it:

  • We reify an object to represent each call to supervise
  • Then we make this reified object a mandatory argument to fork

This seems like a pretty small change, but it does a lot!

It eliminates fiber leaks by making them inexpressible. Users can’t forget to call supervise.

It’s great that they were very clever and careful about implementing higher-level operations like race, but with my version, you don’t have to be clever and careful, because the type system forces you to handle the other fibers.

Since you can’t leak fibers, you get better local reasoning about code – if I invoke some kind of subroutine, and I don’t pass in a supervise object, then the type system guarantees that any fibers that the call spawns internally cannot outlive the call. I’ve previously called this property “respecting causality”, and it makes many common errors inexpressible. With their current design, you could get this by wrapping every subroutine invocation in a supervise, but who wants to do that?

And in terms of design complexity, you get all these benefits essentially “for free” – it’s just a very simple tweak to stuff they needed anyway, no new concepts, no rocket science. It’s so simple that in this respect, Trio actually manages gets stronger guarantees out of Python’s type system than ZIO gets out of Scala’s.

When people start learning Trio, this is a very common question: “how do I get a reference to the enclosing nursery?” They expect there to be some equivalent to ZIO’s fork, that lets then put a task into the enclosing supervise, where-ever it may be. So this is why we don’t have that operation :slight_smile:

On the other hand, an interesting thing about the ZIO style is that it’s probably easier to retrofit into existing libraries/ecosystems. (I think Kotlin works this way too?) I wonder if it would have applications in, for example, Golang.


Two more minor thoughts:

When their supervise block exits, they cancel all the nested fibers, similar to libdill. When Trio exits a nursery block, it stops and waits for all the nested tasks to complete. One trade-off is that this forces ZIO/libdill to have fiber handles, and a join operation to wait for a fiber to finish. Trio is able to get away with skipping both of those concepts. I can see how ZIO’s approach makes sense for them, since the emphasis on pure-functional-style means they really want to join fibers to find out what they evaluated too, and the emphasis on high-level combinators like race and par hides the tedium of joining from most users. Trio makes side-effects a more first-class citizen, so we use those as our primitive for getting results out of tasks, instead of join. Oh huh, and it looks like more recently ZIO added the Trio style as an option, too.

Error handling: ZIO’s way of handling and propagating errors is sufficiently foreign to me that I don’t really know how to compare it to Trio! I worry that their tracebacks might not be very useful? And I noticed that they implement the “crash handler” pattern, where you register some callback to handle errors from unjoined fibers. (Or all crashed fibers?) This pattern always makes me wonder how the crash handler preserves enough context to know how to handle an error. But I am curious how it all fits together.

You’re welcome! Thanks for taking a look and getting past the pure FP part. I’m happy to discuss FP if you’re interested, but I definitely didn’t come here to evangelize FP. :wink:

FWIW, at a high level, the benefits gained from pure FP run in the same direction as the benefits of structured concurrency: they both put the caller in control and therefore make code easier to reason about.

The semantics are somewhat different in a world of pure FP with reified IO computations but aside from that I can see the similarities.

Thank you for highlighting the differences so clearly! For some reason it didn’t click for me until I read this. The potential for fiber leaks has always been something I didn’t like about the Zio design. It’s really interesting to see more clearly now how the Zio design could be modified to avoid that possibility. Making the potential for running fibers escaping a scope visible in the type signature would definitely make things easier to reason about.

The plan for the Swift library I have worked on has been to see if I can get away with keeping the fiber mechanism entirely private to the implementation by providing the right set of higher level primitives along the lines of par, race, etc. It’s not clear to me whether this is a good idea or not (and if so, what the minimum set of primitives would need to be).

So I am curious to learn more about use cases for fibers that need to outlive the scope that created them (but not the nursery that they were started with). These are the use cases that would motivate exposing lower level primitives (fibers, nurseries, etc) outside the library. Do you have any thoughts in this area?

This is very nice and makes the idea of exposing low level primitives like fork and join much more palatable.

I don’t know all of the details about Zio’s error handling, but one crucial bit is that there are two error channels in Zio. The main one is the E in the IO value. The other is the uncaught exception handler you mentioned which is necessary because of the Java exception model. So the design would be a lot cleaner if it weren’t for running on the JVM and having to deal with Java exceptions.

So my conjecture is that {nursery} is the right minimum set of primitives :-). I don’t know if this is correct! In some sense Trio is an experiment to test this conjecture. So far it seems to be working out pretty well though?

I feel like to some extent, you need the nursery to define what “the scope that created them” even means? For example, take the accept loop example we keep writing. I guess a pure functional version using ZIO notation might be something like:

def accept_loop(supervisor, listen_socket, handler) =
  for {
    socket <- listen_socket.accept()
    _      <- handler(socket)).fork(supervisor)
    _      <- accept_loop(supervisor, listen_socket, handler)
  } yield ()

def tcp_server(port, handler) =
  for {
    ls <- open_listen_socket(port)
    _  <- with_supervisor(supervisor -> accept_loop(supervisor, ls, handler))
  } yield ()

It seems like this kind of relies on being able to pass the supervisor around?

I’m not sure if this is the kind of thing you’re looking for or not.