Use case
Imagine a web server. It’s handling many HTTP connections in parallel. The connections may have some kind of timeout: If there’s nothing coming from the cleint for a minute, the server shuts the connection down to prevent resource wastage and DoS attacks.
When the server itself is being shut down it stops accepting new connections and gives existing connections 10 second to cleanly shut down. After 10 seconds it forcefully cancels any remaining connectons and exits.
The problem
Let’s start with a single connection. To prevent DoS, it should always specify a deadline when doing a blocking operation:
...
data = conn.read(deadline = now() + 60);
if(data == TIMEOUT) handle_connection_deadline();
...
So far so good. Now let’s imagine that server is shutting down. It sends a message to all the connections saying “I want to terminate in 10 seconds, please try to do a clean shut down!”
That makes life harder for the connection. Now it has to deal with two different deadlines:
real_deadline = min(connection_deadline, server_deadline);
data = conn.read(deadline = real_deadline);
if(data == TIMEOUT) {
if(now() > server_deadline) handle_server_deadline();
else handle_connection_deadline();
}
What sucks about the above pattern is that it has to be done for every single blocking operation in the connection thread.
But it gets worse.
What if instead of two levels (server and connection) there are three levels? A launches B, which in term launches C. If B is already in process of gracefully terminating C and gets a graceful termination request from A – with a different deadline – what is it going to do? Will it pass the new deadline to C? Will C listen for new graceful termination requests even though it’s already in process of gracefully terminating? Will it do min() three values instead of two? And what if there are four levels? What if there are more?
In short, gracefull termination doesn’t compose and it may even break encapsulation: Each level would have to know about all the levels above it.
Modest proposal
I’ve been banging my head against this problem for a year or so until I’ve came with something that actually works.
The problem is that if you try to add more syntactic machinery to deal with the problem (soft-cancelation vs. hard-cancelation or somesuch) the semantics tend to gets complex and the simplicity and elegance of the entire structured concurrency model just drowns in the complexity.
There’s one crucial observation to be made before we can get out of the mess. Namely, to quote Leo Tolstoy, all hard cancelations are alike; each soft cancelation is happens in its own way.
Compare a thread that writes to disk and a thread that handles a WebSocket connection. If asked to hard-cancel both would simply exit. If asked to shut down gracefully though, the former will try to flush the buffers to the disk. The latter will try to do terminal handshake with the peer.
To put it differently, hard cancelation can be fully handled by the language runtime. Soft cancelation, on the other hand, always requires a some application-specific manual work.
And once we accept the fact that soft-cancelation is mostly an application issue, the solution is not hard to see.
The child thread will get a graceful termination request from the parent, but the request won’t contain any deadline. It would be a simple signal saying “please stop doing the normal work and switch to the termination phase”. The child will then, for example, stop writing logs to a file and it will try to flush the buffers instead.
Note how simple the control flow of the child is and how it requires no complex deadline gymnastics.
Now, let’s have a look at the parent thread.
It sends the graceful termination request to the child (or children), then it waits for either a.) child terminating b.) it’s own deadline expiring. In the former case the graceful shutdown was successful and we can move on. In the latter case the graceful shutdown wasn’t successful and the child thread is still running. We have to hard-cancel it. The code would look like this:
channel_to_child = channel()
child = scope.launch(child_body(channel_to_child));
...
channel_to_child.send(termination_request);
res = scope.wait(deadline = 10s);
if(res == TIMEOUT || res == CANCELED) {
scope.cancel();
}
In the code above I am being verbose, to make it clear what’s happening. In reality, the last 4 lines can be combined in a single function, e.g. “scope.cancel(deadline = 10s)”.
Please note how the constuct composes. Each thread cares only about its own deadlines. All it has to know about the outer world is that it can get a graceful termination request from outside. Whether there are 5 or 10 nested levels of cancelation scopes above it, it doesn’t care. Also note how hard cancelation (res == CANCELED) cancels any ongoing soft cancelation attempts.
I’d love to hear what other people have to say on the topic!