Structured Concurrency in Ruby

Hi all,

I’ve been working on Polyphony, a new concurrency library for Ruby that implements structured concurrency using Ruby fibers (basically coroutines), that people on this list might find interesting.

The docs are here: https://digital-fabric.github.io/polyphony/ and the source code is here: https://github.com/digital-fabric/polyphony. Let me know if you have any questions.

Sharon

1 Like

The echo server example, with its many clients possibly ECONNRESETing, led me to wonder: How is fiber priority handled? The worst case for echoing (off the top of my head) would be if there’s a tipping point where most clients are timing out waiting to receive their echo.

You probably mean this: https://digital-fabric.github.io/polyphony/getting-started/tutorial/#conclusion

The fiber run queue is a simple FIFO queue. It contains only fibers that can be resumed, in the case of the echo server example, only for connections that have data ready. The routine in charge of switching between fibers is here: https://github.com/digital-fabric/polyphony/blob/master/ext/polyphony/thread.c#L111

When a fiber reads from a socket and gets back an EAGAIN, it calls this routine, which pulls the next fiber from the run queue and switches to it. If the run queue is empty, the routine will run the event reactor until at least one event watcher has fired, which results in fibers being added to the run queue.

With this design, the code does not run inside of an event reactor loop. Instead, the event reactor is ran only when there’s no more work to do (there’s also a mechanism for preventing event starvation in case the run queue never empties). Any fibers that are not in the run queue do not have any impact on the system (except for their memory footprint, about 8-10KB per fiber). So, you can have a lot of inactive clients but still the server stays responsive to the active ones because it is I/O bound and only deals with doing actual work based on sockets that are ready to be read/written to.

In other words, in order to DoS such a server, you’ll have to have lots of clients that are very very busy :-).

That’s pretty cool!

I don’t want to be nitpicky or gatekeeper-y, but I’m not sure whether this should actually be called “structured concurrency”. The core idea of structured concurrency is that you want to be able to treat function calls as black boxes — they start, [stuff happens], then they finish and are done. And that means that a function call shouldn’t be able to start a task that keeps running after the function finishes (or at least, not without some explicit ceremony).

The design you have, where child task lifetimes are scoped to the parent task lifetime, doesn’t have this property: functions can surprise their callers by returning while their tasks are still running. (They’ll eventually have to stop before the parent task finishes, but by that time the damage may already be done.)

This kind of design might be interesting in its own right, but I think it’s importantly different from “pure” structured concurrency primitives like nurseries or CSP’s parallel composition. So maybe we should use a different word for it to avoid confusion?

I don’t really know how to reply to this, so let me start by thanking you for your work, which has been a tremendous inspiration to me. My own little adventure in this field owes a lot to Trio.

I understand your argument, but my interest in the concept of “structured concurrency” is practical rather than theoretical. For me, structured concurrency means: “a concurrent task is guaranteed to terminate before its parent task does”. That said, I understand that the idea is that the source code clearly express where a concurrent task begins and where it ends. At least I got the error propagation stuff right (I think). :wink:

I guess I can come up with some control-flow constructs that fulfill those requirements. But there’s two things I’d like to avoid: passing a nursery object to any method that wants to spin up a fiber, and building an entire layer of abstraction that prevents Polyphony from being integrated with other legacy Ruby code. I’d prefer Polyphony be useful rather than correct.

Frankly, I’m totally fine with you saying what I’m doing is not structured concurrency. As I said, I don’t really know how to react to this. I might call it “unpure structured concurrency”, “inspired by structured concurrency”, or just continue calling it “structured concurrency” and note that it doesn’t really adhere to the prescribed maxims of structured concurrency. Let me think about it. :slight_smile:

Yeah, like I said, I’m not saying this to be gatekeeper-y… it’s not like concurrency is a solved problem, and there’s room for different kinds of experiments. But I do think it would be good for everyone if “structured concurrency” doesn’t become diluted into a vague term where no-one knows what it means it exactly :slight_smile:

Ha, I discovered nurseries because I was frustrated about error propagation, expected them to be kind of awkward and annoying, and then when I tried using them I shocked at how practical and convenient they turned out to be. (E.g., I was originally dreading trying to write happy eyeballs, and then the library just kind of … showed me how to do it.) The “theory” was 100% driven by trying to figure out why they worked so well in practice, not the other way around.

Have you tried writing any code with nurseries? A lot of people seem to find it awkward at first, and then something “clicks” and suddenly they can’t imagine going back.

Most of the time, you don’t pass nurseries around, you just create them at the place where you want to use them. OTOH, having the option to pass them around can be very powerful b/c it gives you a controlled way to break the rule “child tasks end before parent tasks”. E.g. in a web server, a response handler might want to spawn a task that lives beyond the response, and giving it a nursery is a way to do that.

BTW, how are you handling error propagation? With nurseries there’s an obvious place for the child error to enter the parent (= the end of the nursery block), but I don’t quite see how this works for your version. If the parent is off doing something else when a child crashes, what happens? Or what if the parent has already exited?

b/c it gives you a controlled way to break the rule “child tasks end before parent tasks”.

I think what the issue here is, what you mean by “task”: do you mean a function (or “method”), or do you mean a coroutine (or “fiber”)? In Polyphony, fibers always end before parent fibers, but you can start a fiber in a method and it will be limited to the lifetime of its parent fiber rather than the method itself, so it can continue running after the method call has returned. Polyphony has no concept of nurseries. There’s just a hierarchy of fibers, but you can always make sure any fibers spawned in nested method calls are done by the time the call returns by doing this:

def some_method
  spin { ... }.await
end

The fiber created in order to wrap whatever operation is being done acts as a de-facto nursery, and you can pass it around just like a nursery. It seems to me that it’s just a case of differing default behaviors, but you can achieve the same behavior in both Trio and Polyphony.

BTW, how are you handling error propagation?

Any uncaught exception is going to cause the termination of the current fiber, and will propagate up the fiber hierarchy until a suitable exception handler is found. Since child fibers are always spun on the same thread as the parent fiber, so if the child fiber is running that means the parent fiber is either suspended or waiting to be resumed. The error propagation mechanism simply schedules the parent fiber with the exception, causing it to be raised eventually in the context of the parent fiber once it is resumed.

As I already mentioned, the parent can not have terminated without first terminating all child fibers. In Polyphony, one needs to explicitly wait for child fibers, otherwise any child fiber will be terminated when the parent fiber is done running. There are multiple ways to wait for child fibers: Fiber#await_all_children, Fiber.await(...), Fiber.select(...), suspend, sleep… There’s also a #supervise method that allows waiting for all child fibers and optionally restarting them if they fail with an exception.

So this is also an area where there’s a difference in default behavior. In Trio, the default behavior is a nursery waits for all child tasks to terminate. In Polyphony, the default behavior is you don’t wait, but you can wait for all child fibers if you need to.

Does this make sense to you?

I mean what you’re calling a “fiber”.

Here’s the example in more detail:

  • We have an HTTP server
  • For each request, the server spawns a new fiber to run the user’s request handler, which will terminate when the response is sent
  • Now a request handler wants to spawn a background task that outlives this HTTP request. How does it do that?

This is why the structured concurrency article makes such a big deal about the analogy with “go to” :slight_smile:. Back in the 70s, folks made this exact same argument: “go to” is powerful enough to implement all the structured control flow operators, so you can just use it carefully, and everything will be fine.

The problem with this argument is that “go to” breaks abstractions – if a function uses “go to” internally, then that’s effectively part of its public semantics. Which means that if some random library in your dependency stack uses “go to” in a bad way, then surprise, you’re using it too! So even if you want to be careful to structure your code correctly, the only way to do that is to audit every library you use to make sure that they didn’t make any mistakes.

And that’s why you can’t just say well, I’m smart enough to use it responsibly. The benefits of structured programming come from knowing that everyone else is using it responsibly.

The same is true for primitives like go or spin – if you allow them anywhere, then you lose most of the structured concurrency benefits.

So this means that any operation can raise any arbitrary error from any unrelated child task, right? That seems really problematic to me… like in Python I might write:

# retry loop
for duration in backoff_schedule:
    try:
        return await get("https://...")
    except NetworkError:
        await sleep(duration)

But it sounds like in Polyphony’s current design, if some other function spawned a task that’s doing some unrelated networking with an unrelated peer, then that might suddenly propagate out of my get function, and be misinterpreted as a failure doing this HTTP request, and trigger the wrong response.

I think this difference is mostly a matter of taste. There are reasons why I think Trio’s default is more convenient, but you’re in such a different part of the design space that I wouldn’t expect them to necessarily carry over. (And Trio will probably add a way to mark certain tasks as auto-cancelled when the nursery is closed.)

In your previous reply you wrote:

… it gives you a controlled way to break the rule “child tasks end before parent tasks”.

If task = fiber, then child fibers end before parent fibers, which is indeed the case for Polyphony, so where is the problem here?

Let’s discuss the example you give for an HTTP server:

Now a request handler wants to spawn a background task that outlives this HTTP request. How does it do that?

There are multiple ways to achieve this, here’s one of the top of my head:

require 'polyphony'

$background_supervisor = spin { supervise } # wait for child fibers

def handle_request(req)
  ...
  $background_supervisor.spin { ... }
  ...
end

So a background_supervisor is a fiber that acts as a nursery. You can either keep it around as a global variable (as in the above example), or you can pass it as an argument, or use some other DI mechanism.

The same is true for primitives like go or spin – if you allow them anywhere, then you lose most of the structured concurrency benefits.

Using golang’s go is not like using Polyphony’s spin. Goroutines have no hierarchy, while fibers do. Goroutines have no error propagation mechanisms (that I know of, correct me if I’m wrong), while Polyphony fibers do. And if you call spin at the wrong place, your program might fail.

It is true that in Polyphony you can have a situation where some nested method call can spin up a fiber without the caller knowing it. But my reasoning is that the alternative (having to explicitly pass nursery objects around) would prevent achieving one important design goal that I set for myself, which is that it should be possible to integrate Polyphony with the vast majority of the existing Ruby ecosystem. As I wrote previously, I prefer Polyphony be useful rather than “pure” or “correct”.

So this means that any operation can raise any arbitrary error from any unrelated child task, right?

Yes. How else would you want to be structured?

But it sounds like in Polyphony’s current design, if some other function spawned a task that’s doing some unrelated networking with an unrelated peer, then that might suddenly propagate out of my get function, and be misinterpreted as a failure doing this HTTP request, and trigger the wrong response.

That is correct, unless the spawned task that’s doing unrelated stuff is spawned in the context of another controller fiber, as in the code example above. I mean, if you follow that line of reasoning, maybe we should all just implement checked exceptions à la Java :smiley:

Thank you for this debate! Having to explain how Polyphony works and how it differs from Trio makes things clearer for me as well, and also shows me how Polyphony might be improved, especially regarding documentation and “developer education”.

BTW, you talked before about “happy eyeballs”, here’s a version of it in Polyphony, which demonstrates some of the issues we’ve been discussing. I’ll be happy to explain how this works:

require 'polyphony'

def try_connect(target, supervisor)
  puts "trying #{target[2]}"
  sleep rand * 0.2
  socket = TCPSocket.new(target[2], 80)
  puts "connected to #{target[2]}"
  supervisor.schedule [target[2], socket]
rescue IOError, SystemCallError
  # ignore error
end

def happy_eyeballs(hostname, port, max_wait_time: 0.010)
  targets = Socket.getaddrinfo(hostname, port, :INET, :STREAM)
  t0 = Time.now
  fibers = []
  supervisor = Fiber.current
  spin do
    targets.each do |t|
      spin { try_connect(t, supervisor) }
      sleep(max_wait_time)
    end
    suspend
  end
  target, socket = move_on_after(5) { suspend }
  supervisor.shutdown_all_children
  if target
    puts format('success: %s (%.3fs)', target, Time.now - t0)
  else
    puts 'timed out'
  end
end

happy_eyeballs('debian.org', 'https')

idk, to me there’s a big difference between checked exceptions vs making it impossible for a function to interpret exceptions reliably without knowing about what else its caller is doing. What would your documentation suggest users do to fix that retry loop?

I think there’s a serious bug: if the caller has already spawned any other fibers, then happy_eyeballs will kill those. Obviously you can fix this now that you know about it, but to me the whole point of structured concurrency is that people can get these kinds of algorithms right without having to notice these kinds of tricky edge cases.

I guess this is the point where I’m probably missing something. In Trio, our libraries generally don’t pass nurseries around at all. E.g. our HTTP clients look just like the existing synchronous ones (just with async/await annotations added, but that doesn’t apply to you). And the existing Ruby ecosystem isn’t using nurseries or spin, so there’s no backwards compatibility issue there.

Can you give any examples of how existing libraries would break or become unusable if you used nurseries?