Concurrency fuzzing

Anne_Archibald · September 27, 2023, 6:16pm

Async/await-style concurrency makes it harder to introduce deadlocks and concurrency-based bugs compared to thread-based concurrency. Nevertheless it is possible. I think it should be possible to systematically test for race conditions and deadlocks. Has such a system been implemented?

My suggestion would use hypothesis. Hypothesis is a “fuzzer”: it is designed to take a function, a description of its inputs, and some invariants and explore the space of permissible inputs in an attempt to find inputs that make the function violate its invariants. Hypothesis can also “shrink” these inputs in an attempt to find a minimal example that causes the failure.

In an async/await program, there are a finite number of places where a context switch can occur, and a finite number of possible places execution can go next. Trio chooses among the currently runnable coroutines at random to make it clear that there are no guarantees which gets run. With appropriate plumbing, hypothesis should be able to explore the possibilities in search of a problem.

In fact, hypothesis has a “stateful testing” system, in which the system under test can be set up, taken through a sequence of operations, and then checked for its invariants. As with data fuzzing, these sequences of operations can be shrunk to (attempt) produce a minimal example.

A hypothetical hypothesis concurrency fuzzer would take over the event loop. Each time control returns to the event loop, either one or more coroutines are ready to go, or all coroutines are awaiting something external. The simplest approach here would have hypothesis exploring the possible different selections of coroutine to execute next. It would then explore these sequences to see if it could trigger a deadlock or exception. A more sophisticated approach could also explore the orderings of the external events (which may well need to be mocked anyway) - the HTTP request could come back first, or the timeout could expire first.

This implementation (in Python) would not be trivial - it would need to include a custom event loop, and possibly mock implementations of I/O and other primitives at the other end of the async stack. But I don’t think it would necessarily require much knowledge of the internals of hypothesis: a concurrent execution tree is a stateful object, and the selection of actions at each stage is the set of coroutines that are currently ready to execute, as well as fudging any primitives whose readiness can be fudged. Clearly expressing the sequence of scheduling decisions should be doable, though not simple.

I am not aware of any system implementing this sort of concurrency fuzzing. Is anyone? Is it just a terrible idea?

njs · October 2, 2023, 2:15am

I’d love to see something like this in Trio, and there’s actually a bunch of notes and links in this old thread:

github.com/python-trio/trio

Tools for finding scheduler-dependent "heisenbugs"

opened 09:48AM - 02 Jul 17 UTC

njsmith

design discussion user happiness pytest-trio relevant debugging

Race conditions suck, but they're a common kind of bug in concurrent programs. W…e should help people test for them if we can. Some interesting research out of MSR: * [CHESS](https://www.microsoft.com/en-us/research/publication/finding-and-reproducing-heisenbugs-in-concurrent-programs/) * [GAMBIT](https://www.microsoft.com/en-us/research/publication/gambit-effective-unit-testing-for-concurrency-libraries/) CHESS assumes you have a preemptive multitasking system and works by hooking synchronization primitives, and explores different legal paths through the resulting happens-before graph. This makes it tricky to find data races, i.e., places where two threads access the same variable without using any synchronization primitive; you really want to be able to try schedules where threads get preempted in the middle of these. In practice, it sounds like the way they do this (see page 7) is to first run some heavyweight data race detector that instruments memory reads and writes, and then use the result to add new annotations for the CHESS scheduler. For us, we don't AFAIK have any reasonable way to do data race detection in Python (maybe you could write something with PyPy? it's non-trivial), but we do have direct access to the scheduler, so we can in principle explore all possible preemption decisions directly instead of having to infer them from synchronization points. However, this is likely to be somewhat inefficient – for CHESS it *really* needs to cut down the space of preemption points because otherwise it has to consider a preemption at every instruction which is ridiculously intractable, and we're in a better position then that, but the average program does still have lots of cases where two tasks that aren't interacting at all have some preemption points and this generates an exponential space of boring possible schedules. They have a lot of interesting stuff in the paper (section 4) about how to structure a search, recovering from non-determinism in the code-under-test, some (not enough) detail about fair scheduling (they want to handle code with spin-locks). It looks like [this is the fair scheduling paper](https://pdfs.semanticscholar.org/adec/c5f64d69031ee5048bac09d1da1fc7811e75.pdf); it looks like the main idea is that if a thread calls `sched_yield` or equivalent then it means it can't make progress, so you should lower its priority. That's... not a good signal in the trio context. Fortunately it's not clear that anyone wants to write spin-locks anyway... (we use the equivalent a few times in our test suite, but mostly only in the earliest tests I wrote before we had much infrastructure, and certainly it's a terrible thing to do in real code). Possibly useful citation: "ConTest [15] is a lightweight testing tool that attempts to create scheduling variance without resorting to systematic generation of all executions. In contrast, CHESS obtains greater control over thread scheduling to offer higher coverage guarantees and better reproducibility." GAMBIT then adds a smarter best-first search algorithm on top of the basic CHESS framework. A lot of the cleverness in GAMBIT is about figuring out which states are equivalent and thus can be collapsed, which we don't really have access to. But the overall ideas are probably relevant. (It's also possible that we could use something like instrumented synchronization primitives plus the DPOR algorithm to pick "interesting" schedules to try first, and then fall back on more brute-force randomized search.) I suspect there are two general approaches that are likely to be most useful in our context: * Hypothesis-style randomized exploration of the space of scheduling decisions, with some clever heuristics to guide the search towards edge cases that are likely to find bugs, and maybe even hypothesis-style minimization. (The papers above have some citations to other papers showing that low-preemption traces are often sufficient to demonstrate bugs in practice.) * Providing some sort of mechanism for users to explicitly guide the exploration to a subset of cases that they've identified as a problem (either by intuition or especially for regression tests and increasing coverage). You can imagine relatively simple things like "use these priorities until task 3 hits the yield point on line 72 and then switch to task 5", maybe? An intermediate form between these is one of the heuristics mentioned in the GAMBIT paper, of letting the programmer name some specific functions they want to "focus" on, and using that to increase the number of preemptions that happen within those functions. (This is sort of mentioned in the CHESS paper too when talking about their state space reduction techniques.) Possibly implementation for scheduler control: call an instrument hook before scheduling a batch, with a list of runnable tasks (useful as a general notification anyway), and let them optionally return something to control the scheduling of this batch. See also: #77

The biggest challenge is figuring out a good heuristic for exploring “interesting” schedules, because the space of valid schedules is so huge that you can easily waste all your time on running similar schedules over and over without finding the edge cases you’re looking for.

There are also a bunch of things that would make this more useful, like detecting when the test successfully finds a deadlock:

github.com/python-trio/trio

Global deadlock detector

opened 03:25AM - 05 Jun 19 UTC

njsmith

potential API breaker

We have an issue for fancier deadlock detection, and API support to make it more… useful (#182). This is about a simpler issue: detecting when the entire program has deadlocked, i.e. no tasks are runnable or will ever be runnable again. This is not nearly as fancy, but it would catch lots of real-world deadlock cases (e.g. in tests), and is potentially *wayyy* simpler. In particular, I believe a Trio program has deadlocked if: * There are no runnable tasks * There are no registered timeouts * There are no tasks waiting on the `IOManager` * No-one is blocked in `wait_all_tasks_blocked` (Did I miss anything?) However, there is one practical problem: the `EntryQueue` task is always blocked in the `IOManager`, waiting for someone to call `run_sync_soon`. Practical example of why this is important: from the Trio scheduler's point of view, `run_sync_in_worker_thread` puts a task to sleep, and then later a call to `reschedule(...)` magically appears through `run_sync_soon`. So... it's entirely normal to be in a state where the whole program looks deadlocked except for the possibility of getting a `run_sync_soon`, and the program actually isn't deadlocked. But, of course, 99% of the time, there is absolutely and definitely no `run_sync_soon` call coming. There's just no way for Trio to know that. So I guess to make this viable, we would need some way to recognize the 99% of cases where there is no chance of a `run_sync_soon`. I think that means, we need to refactor `TrioToken` so that it uses an acquire/release pattern: you acquire the token only if you plan to call `run_sync_soon`, and then when you're done with it you explicitly close it. This will break the other usage of `TrioToken`, which is that you can compare them with `is` to check if two calls to `trio.run` are in fact the same. Maybe this is not even that useful? If it is though then we should split it off into a separate class, so that the *only* reason to acquire the `run_sync_soon`-object is because you're going to call `run_sync_soon`. Given that, I think we could implement this by extending the code at the top of the event loop like: ```diff if runner.runq: timeout = 0 elif runner.deadlines: deadline, _ = runner.deadlines.keys()[0] timeout = runner.clock.deadline_to_sleep_time(deadline) else: - timeout = _MAX_TIMEOUT + if not runner.io_manager.has_waits() and not runner.tokens_outstanding and not runner.waiting_for_idle: + # Deadlock detected! Dump a stack tree and crash, maybe...? + else: + timeout = _MAX_TIMEOUT ``` This is probably super-cheap too, because we only do the extra checks when there are no runnable tasks or deadlines. No runnable tasks means we're either about to go to sleep for a while, so taking some extra time here is "free", or else that we're about to detect I/O, but if there's outstanding I/O then you should probably have a deadline set...

and having better testing shims for networking:

github.com/python-trio/trio

Mock network for testing

opened 07:53AM - 24 May 17 UTC

njsmith

design discussion pytest-trio relevant

@glyph gave a [great talk at PyCon this year](https://www.youtube.com/watch?v=0B…y5yfhkiRs) that involved using a virtual (= in memory, in python) networking layer to build a virtual server to test a real client. As far as the virtual networking part goes, we have some of this, e.g. #107 has some pretty solid in-memory implementations of the stream abstraction. But it would be neat to virtualize more of networking, e.g. so in a test I can have tell my real server code to listen on some-server.example.org:12345 and tell my real client code to connect to that and they magically get an in-memory connection between them. Fixing #159 would reduce the amount of monkeypatching needed to do this, but OTOH I guess monkeypatching the whole `trio.socket` module is probably the simplest and most direct way to do this anyway... or we could hook in at the socket layer (have it check a special flag before allocating a new socket) or at the high-level networking layer (`open_tcp_stream` checks a special flag and then returns a `FakeSocketStream` etc.). Fundamentally there's going to be some global state because no-one will put up with passing around the whole library interface as an argument everywhere, literally every async library has some kind of contextual/global state they use to solve this problem, and I can't think why it would matter a huge amount whether that's `from twisted.internet import reactor` vs `asyncio._get_running_loop()` vs `trio.socket.socket()`. So I'm leaning towards not worrying about monkeypatching. (The one practical issue I can think of is if someone is trying to use trio in two threads simultaneously, then this will cause some problems because the monkeypatch would be global, not thread-local. Maybe we can make it thread-local somehow? Or maybe we just don't care, because there really isn't any good reason to run your test suite multi-*threaded* in Python.) Oh, or here's a horrible wonderful idea: embed the fake network into the regular network namespace, so like if you try to bind to `257.1.1.1` or `example.trio-fake-tld` then the regular functions notice and return faked results (we could even encode test parameters into the name, like `getaddrinfo("example.ipv6.trio-fake-tld")` returns fake ipv6 addresses...). Of course this would be a bit of a problem for code that wants to like, use the ipaddress library to parse `getaddrinfo` results. There are the reserved ip address ranges, but that gets dicey because they *should* give errors in normal use... In practice the solution might be to stick to mostly intercepting things at the hostname level (e.g. `open_tcp_stream` doesn't even need to resolve anything when it sees a fake hostname), though we do need to have some answer when the user asks for `getpeername`. I guess we could treat all addresses as regular *until* someone invokes this functionality with a hostname, at which point some ip addresses become magical. BUT there would also still very much need to be a magic flag to make sure all this is opt-in at the `run` loop level, to make sure it could never be accidentally or maliciously invoked in real code, to avoid potential security bugs. At which point I suppose that magic flag could just make all hostnames/addresses magical. Oh well, I said it was a horrible (wonderful) idea :-). The bit about having hostnames determine host properties might still be a good idea. There's also a big open question about how closely this API should mimic a real network. At the very least it would have to provide the interfaces to do things like set `TCP_NODELAY` (even as a no-op), for compatibility with code made to run on a real network. But there are also more subtle issues, like, should we simulate the large-but-finite buffers that real sockets have? Our existing in-memory stream implementations have either infinite buffering or zero buffering, both of which are often useful for testing, but neither of which is a great match to how networks actually work... and of course there are also all the usual questions about what's kind of API to provide for manipulating the virtual network within a test. I suspect that this is a big enough problem and with enough domain-specific open questions that this should be a separate special-purpose library? Though I guess if we want to hook the regular functions without monkeypatching then there will need to be some core API for that. Prerequisite: We'll need run- or task-local storage (#2) to store the state of the virtual network.

Topic		Replies	Views
Concurrency and trio as implmentation detail in blocking/sync method	7	2359	April 15, 2021
Priorities/roadmap Internals	17	4080	March 2, 2019
Discussion: "Notes on structured concurrency, or: Go statement considered harmful" Structured concurrency	24	7933	May 12, 2025
Structured Concurrency Kickoff Structured concurrency	22	10716	February 15, 2019
Why are Python sockets so slow and what can be done? Internals	26	30777	June 9, 2019

Concurrency fuzzing

Related topics