Project Loom – lightweight concurrency for the JVM

#1

I’m working on Project Loom, a project in OpenJDK to add support for fibers (essentially user mode threads) and delimited continuations to the Java platform. Project Loom is in prototyping and exploration phase right now.

In Project Loom we are exploring many of the concepts that you are discussing here, Nathaniel and Martin’s blogs are excellent references. We hope to have something written up on the Project Loom wiki page soon.

The current prototype has the notion of a scope in which fibers are scheduled. There is basic support for cancellation, deadlines, and reaping of terminated fibers (to collect the results of tasks and exceptions). Terminology is a problem and “scope” is already overused.

Here’s the equivalent of the example of #1 where you have a server accepting connections:

ServerSocket listener = ...
try (var scope = FiberScope.cancellable()) {
    while (...) {
        Socket s = listener.accept();
        scope.schedule(() -> handle(s));
    }
}

scope.schedule(task) schedules a fiber to execute a task, in this case it will handle a socket connection. The thread (or fiber) executing in the scope cannot exit until all fibers scheduled in the scope have terminated.

For #2, scopes are just objects so they live in the heap. You can pass them as parameters if you want but it may be more prudent to guard the reference (there a several discussion points there). A thread/fiber exited a scope causing it to be closed, no further fibers can be scheduled in the scope.

For #4, timeouts, a thread or fiber can enter a scope with a deadline (an instant in time) or a timeout (a maximum duration). If the deadline is reached or the timeout expires then all fibers scheduled in the scope are cancelled. Cancellation is a huge topic and there are significant challenges to retrofitting existing APIs.

The following is a more complete example that might be useful for the discussion here. It’s a method that returns the result of the first “successful” task. One a task completes successfully then all the outstanding tasks are cancelled. If no task succeeds it returns the exception from the first task to fail. The method enters a scope with a deadline so that all fibers are cancelled if the deadline expires before a result is returned:

    <V> V anySuccessful(Callable<? extends V>[] tasks, Instant deadline) throws Throwable {
        try (var scope = FiberScope.withDeadline(deadline)) {
            var queue = new FiberScope.TerminationQueue<>();
            Arrays.stream(tasks).forEach(task -> scope.schedule(task, queue));
            Throwable firstException = null;
            int remaining = tasks.length;
            while (remaining > 0) {
                try {
                    V result = queue.take().join();
                    // cancel any fibers that are still running
                    scope.fibers().forEach(Fiber::cancel);
                    return result;
                } catch (CompletionException e) {
                    if (firstException == null) {
                        firstException = e.getCause();
                    }
                }
                remaining--;
            }
            throw firstException;
        }
    }

This example uses a termination queue to collect fibers as they terminate. This is a different approach to automatically propagating exceptions.

There are a few other concepts such as nesting and a “non-cancellable” scope to shield fibers from cancellable during recovery, cleanup or critical operations. This may be relevant to some of the discussion here.

Structured Concurrency Kickoff
The terminology bikeshed thread
Structured Concurrency Kickoff
#2

Hey Alan, welcome! :wave:

Sounds like you’ve been thinking about this a lot and it’s a very complex project, so I’m looking forward to the wiki page with more details! In the mean time, I have a few questions to hopefully spur some discussion :slight_smile:

Have you considered separating the concept of cancellation from fibers? I found that this really changed how I approach things. And it makes sense… in general, any operation could and probably should support cancellation, even in sequential code – every I/O operation should have a timeout! The only way that concurrency changes this is that now you have to be prepared for some other fiber/thread to change your timeout. Having a generic, composable way to impose timeouts on arbitrary operations is really powerful.

Have you considered binding scopes more tightly to the creating frame? I’m imagining something like:

ServerSocket listener = ...
FiberScope.run(scope -> {
    while (...) {
        Socket s = listener.accept();
        scope.schedule(() -> handle(s));
    }
});

The potential advantage is that you effectively force the try-with-resources form, plus you get more control over the execution environment for the code inside the main block. For example, if a background fiber crashes with an exception, you have the option of automatically cancelling the loop and then propagating the exception from FiberScope.use.

We used to have something like this in Trio, but gave it up because it was causing lots of complications and not very useful. One problem we failed to solve was avoiding leaks for users that didn’t care about reaping – I see you avoid that by only creating zombies if a queue is explicitly created and passed to schedule? That’s clever. The other issue was that we wanted to propagate exceptions by default, so they couldn’t be accidentally dropped on the floor, but then there was always a question about whether we knew anyone was reading from the queue or not, so we could decide whether someone was taking responsibility for the exception or whether it was better to auto-propagate it… What do you think about auto-propagation? Is your plan to always require manual propagation?

Putting these together, we can write your example using code like:

<V> V anySuccessful(Callable<? extends V>[] tasks) throws Throwable {
    var queue = new PickYourFavoriteFiberSafeContainer<V>();
    FiberScope.run(scope ->
        Arrays.stream(tasks).forEach(task -> {
            scope.schedule(() -> {
                queue.append(task());
                scope.cancel();
            }
        })
    });
    return queue.pop();
}

It re-raises any exceptions (we have a MultiError construct to raise all of them, instead of just the first), supports deadlines, etc.

Yeah, this sounds like a huge issue. I’d love to hear more about your plans here.

Trio also has a concept of shielding you might find interesting. Interestingly, we only use it in very rare cases. For most cases, we have an interesting convention: for any closeable object where the close method is async (and thus cancellable), the rule is that it must succeed even if cancelled, though possibly in a graceless way (e.g., maybe it won’t send the protocol-specific message, but it’ll still close the socket). Reference.

I’d also love to hear more about how you plan to make cancellation usable. (I’m assuming that as Java designers, a major goal is to avoid a repeat of the Thread.stop disaster.) Is shielding the only way for users to manage/track cancellation points, or will there be other mechanisms as well? (Checked exceptions?)

#3

Have you considered separating the concept of cancellation from fibers?

Not at this time, mostly because cancellation is a deep topic with a lot of consequences that will require building up confidence. In addition, Threads already have an interrupt status to support the concept of thread interruption (similar concept but also different, the main one being that a Fiber’s cancel status cannot be reset).

Every I/O operation should have a timeout

Decomposing deadlines or timeout into timeouts for specific I/O operations is really hard to get right (your blog on this topic is really good). The approach in the prototype has a withDeadline method to enter a scope with a deadline. If the deadline expires before the scope has exited when all fibers scheduled in the scope are cancelled. If fibers are parked in blocking I/O operations then they will throw an exception and terminate. It’s possible that code catches the exception and doesn’t terminate quickly but there is nothing we can do about that. There are slew of discussion poinst around this of course and we will hopefully have more on this on the Loom wiki soon.

Have you considered binding scopes more tightly to the creating frame?

We will likely be iterating on the API for some time. But yes, initial prototypes had methods that took something like Consumer<FiberScope> and overloads to deal with tasks that return a value and of course checked exceptions. The approach leads to a lot capturing of effectively final local variables in the enclosing scope.

What do you think about auto-propagation? Is your plan to always require manual propagation?

We have experimented with propagation of exceptions but it can be argued that a bit “too magic”, esp. if the exception (and stack trace) is far removed from the code that scheduled the fiber. Our current approach requires an explicit join to collect results or exceptions.

Trio also has a concept of shielding you might find interesting. Interestingly, we only use it in very rare cases. … Is shielding the only way for users to manage/track cancellation points, or will there be other mechanisms as well?

Yes, I agree its usage will be rare but it is important for some cleanup operations. Asynchronous close is one example where closing a resource that is in use by another thread or fiber requires coordination that needs to be immune to cancellation to avoid leaving the resource in in consistent state.

Cancellation in the current prototype is cooperative, a fiber can check for cancellation at any time with Fiber.cancelled(). I hope to be able to point to something on Loom wiki soon that explains this in more detail.

Structured concurrency in Rust
#4

We have an initial wiki page on this topic. We’ll update this as the exploration and prototype evolves.

#5

At this time, a FiberScope object need to be guarded to avoid leaking to a callee that closes the scope. Another concern is the fibers method to obtain a stream of the fibers executing in the scope. The main use-cases for the fibers method is cancellation and debugging, both of these need to be re-examined.

This is a thing I personally feel quite strongly about: The scope should not be leaked to the child fiber. It allows child fiber to cancel its parent and its siblings which breaks encapsulation. For example, if you launch a well-debugged fiber function in a different context you can’t just expect it to work. You have to check out whether it cancels the scope and if so, you have to think about interactions with other fibers that happen to run in the same scope.

I’ve written about it a bit here: http://250bpm.com/blog:139

To put it shortly, I think that scope should be canceled automatically when the the parent fiber reaches end of the block. No waiting for the children. They get mercilessly cancelled. That way they don’t to have access to the scope – it’s not their task to cancel the scope in the first place.

The reasoning is that not wanting to wait for the children is more common scenario (e.g. when the accept loop exits we want child fibers, handling individual connections, to be canceled immediately).

The opposite scenario is when the children are performing useful computation and the parent fiber wants to get all the results before moving on. But note that in this case the results have to be sent to the parent fiber via a queue and thus the parent will be blocked on reading from the queue before exiting and cancelling the children.

As a convenience when scheduling fibers in a FiberScope, the API defines schedule methods that specify a termination queue that the fiber is queued to when it terminates.

Now, I don’t feel about this as strongly as about the previous point, but this seems to point in the direction of “typed fibers”. The idea there is that a particular scope can launch only one type of function.

    try (var scope = FiberScope<foo>.cancellable()) {
        scope.schedule();
        scope.schedule();
    }

I see two advantages to the approach. First, some discipline is forced upon the user. The scopes are no longer bags of random functionality. They have clear business logic. Second, the language runtime knows the signature of the function and thus can, for example, automatically provide a typed queue to store the results in.

var fiber1 = scope1.schedule(() -> foo());

Is it a good idea to return a fiber handle? It looks like the only thing it does is that it gives the user a way to mess up with the correct functioning of the scope.

#6

The scope should not be leaked to the child fiber. It allows child fiber to cancel its parent and its siblings which breaks encapsulation

As things stand, a FiberScope is owned by the strand that entered it (think strand or stack confinement) so it can’t be closed by another strand. If a child somehow gets a reference to the parent Fiber then it could set the fiber’s cancel status.

To put it shortly, I think that scope should be canceled automatically when the the parent fiber reaches end of the block. No waiting for the children. They get mercilessly cancelled.

This is a good discussion topic. Some prototyping and exploration but nothing decided on this.

Is it a good idea to return a fiber handle? It looks like the only thing it does is that it gives the user a way to mess up with the correct functioning of the scope.

handle = Fiber object. Yes, this is important as Fiber defines join to wait for the fiber to terminate and gets its result, it’s also important for cancellation.