Why are Python sockets so slow and what can be done?

Hello,

I want to share my findings and maybe find solution together.

The Performance. Very important subject to me and, I believe, to many.

I’ve being testing Curio, Trio and UVLoop on CPython 3.7 and pypy 6.0. Tests performed on cloud Fedora 29 x64 servers (no ads, but if anyone interested it was Vultr with 6 CPUs and 16Gb). I’ve user curl for validation and wrk for load test. I ran wrk from two distinct machines to make sure client side is not bottleneck. Absolute numbers are not that important, because configurations vary, but I’ll give them anyway.

At first, I’ve wrote similar HTTP server applications for all frameworks, using httptools for parsing. Logic was simple: read entire GET request, httptools tells me when it’s done, then send Jinja2 generated response with client IP address and time. This test seems more or less realistic to me. No POST support, no request body parsing, no URL routing. While all of these matters for sure, nothing is I/O layer specific, so I’ve settled with minimal valid server I can load test with existing HTTP test tools.

So I have performed the tests. 100 threads, 100 connections, 30 seconds each. Requests were 43 bytes, responses 379 bytes.

Results are quite disappointing to me. Trio and Curio are at least twice slower than uvloop, processing 4K, 5K and 11K requests per second. Trio was significantly slower than Curio, but not the point right now. Even running four process instances with SO_REUSEPORT did not help much, requests per second were higher, but uvloop was still at least twice faster. CPU load was not high and nothing I could blame in particular.

I started digging why so. I discovered very soon (OK, it was on second day, but I want to look smart) that Curio and Trio have nothing to do with performance issues. Python’s select/selectors module is not optimal. So my second test excluded Curio, Trio, anything asyncio. I’ve just created simple epoll based server and client. Client connects to server and sends data as fast as it can. Server accepts connection and reads data at fast as it can. I’ve got 300-400Mbps (that’s bytes, not bits), which I consider is very good for Python on single core. I am confident I can utilize 90% of 10G network if using all cores. But when there are many short lived connections, then results are no so pleasant. So I’ve rewritten my HTTP server with raw epoll. I’ve got 11K rps for uvloop, 5K rps for epoll. Surprisingly poll behaved better processing 6K rps. I think that it because there were never large number of handles to watch for, less than 20.

No matter what I did, played with pypy3, TCP_NODELAY all other things, I was not able to get more than 6K rps requests with poll/epoll.

I’ve increased Jinja2 template size, added some lorem ipsum so that response was 3044 bytes. While all libraries rps dropped, UVLoop was still twice faster that Trio, Curio, epoll or poll. BTW, in this test raw epoll performed a bit better than poll since there were more concurrently watched descriptors, up to 30.

My current problem is that, despite I’ve read libuv source code, I have no idea what are they doing to outperform select module that much. I am not C, or Linux networking guru, maybe I miss something very simple, but anyway, the question is still open to me and I have no clue.

My next problem depends on answer to the first one: what can be done to achieve comparable performance. Maybe 80% of uvloop will be OK, but 50% is really sad result. One may say “That’s Python, what the heck did you expect from interpreted language?!”, but I like Python too much to give up that easy or clean and fast code :slight_smile:

P.S. Alternative async community seems sparse, so I crossposted this, sorry if that’s offense or anything.

Hello!

Can you share your scripts? I don’t think anyone can answer a question like this based on just high-level theory; it almost certainly has to do with some tiny details of what you’re doing.

Also something to keep in mind: is your actual app going to have POST support, URL routing, etc.? If so, then these IO library differences are going to be less important… It’s still important and will make some difference, but if the non-IO parts of your code can only do 1000 rps, then none of these IO libraries will be the bottleneck. OTOH I’m still curious why it’s happening :slight_smile:

Cross-link to the corresponding thread on the dabeaz forum:

Sure! Here are gists.


To get a baseline, I first checked to see if I could reproduce your results on my laptop. I used an empty test.html, and benchmarked with wrk -c 100 -d 30 http://localhost:7777.

Your epoll-test.py, with CPython 3.7, on my laptop:

Running 30s test @ http://127.0.0.1:7777
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.43ms   89.80ms   1.89s    98.47%
    Req/Sec     3.85k     1.95k    9.88k    65.21%
  220232 requests in 30.03s, 40.33MB read
  Socket errors: connect 0, read 220230, write 0, timeout 12
Requests/sec:   7333.85
Transfer/sec:      1.34MB

Your uvloop-test.py, with cpython 3.7, on my laptop:

Running 30s test @ http://127.0.0.1:7777
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.78ms    1.01ms  15.16ms   76.84%
    Req/Sec     7.52k     1.17k    9.94k    69.33%
  448748 requests in 30.03s, 80.88MB read
  Socket errors: connect 0, read 448711, write 0, timeout 0
Requests/sec:  14943.75
Transfer/sec:      2.69MB

So that reproduces the basic 2x difference you saw.

I peeked at epoll-loop.py in a profiler, and the first thing I noticed is… a bunch of time spent in asyncio (!!). It turns out if you use the enable_async=True option to jinja2, then every time you call the regular synchronous render method, it starts up asyncio loop and then uses it to call render_async. For our epoll program, starting and stopping an asyncio loop on every template render is just silly. So I removed the enable_async=True configuration from jinja, and that immediately made epoll-loop.py noticeably faster:

Running 30s test @ http://127.0.0.1:7777
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    12.27ms   93.96ms   1.83s    97.34%
    Req/Sec     5.67k     3.16k   15.18k    70.10%
  330446 requests in 30.04s, 60.51MB read
  Socket errors: connect 0, read 330445, write 0, timeout 10
Requests/sec:  10999.61
Transfer/sec:      2.01MB

So now it’s about 74% as fast as uvloop, which isn’t too far from your 80% goal :slight_smile:

But we can do better. It took me a while to figure this out, but the epoll server’s terrible latency statistics were bugging me, and didn’t make any sense. (Why does it take the epoll server almost 2 seconds to respond to some requests, when it always processes every request in a perfectly fair round-robin fashion?) The problem is that you’re doing server_socket.listen(1), which dramatically limits the size of the server’s accept queue. This means that we’re only able to accept a small number of new connections per pass through the loop, and that probably some incoming connections are being dropped on the floor inside the kernel, and only retried after some timeout.

In modern systems, you basically always want to set the backlog to the maximum possible value (detailed analysis). Which is 65535.

After I switched it to server_socket.listen(65535), I get:

Running 30s test @ http://127.0.0.1:7777
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.67ms    1.09ms  29.81ms   75.82%
    Req/Sec     8.17k     1.23k   10.66k    68.17%
  487843 requests in 30.06s, 89.33MB read
  Socket errors: connect 0, read 487843, write 0, timeout 0
Requests/sec:  16228.61
Transfer/sec:      2.97MB

So, there you go, 109% as fast as uvloop :smiley:

The problem, though, is that since this code is doing “almost nothing” to start with, then “almost anything” you add will cause a large slowdown, at least in percentage terms. And you’ll have to add things if you want to do things like… implement HTTP properly, or stop writing raw epoll loops. So I think that 16k requests-per-second is pretty unrealistic. (And that’s true for uvloop as well.)

The truth is that Python isn’t as fast as some other languages. It has other advantages, like usability. But if you have a use case where you need a single process to handle >1k incoming connections per second, then it’s possible Python just isn’t the right tool. If you need a single process to handle >10k incoming connections per second, then Python is almost definitely not the right tool.

But, let’s put this in perspective… the wikimedia foundation served 17.03 billion pageviews last month – that’s all the wikipedias in every language, plus lots of other things (ref). If I did my math right, that’s an average of ~6k pageviews/second. They handle this using hundreds of servers across 8 datacenters. If you can serve 1k requests/second on 1 CPU, then that’s not good enough to handle every problem, but it’s good enough to handle a lot of problems.

3 Likes

Thanks for your great help and quick answer! This is very important step forward in my research.

What Profiler do you use? I stuck with cProfile and either I use it wrong way or it’s not that useful.

I retested my Curo and Trio servers with enable_async=False and results are still much worse than epoll. I will try to figure out what’s wrong by myself and report in a few days.

I do not consider Python slow. Slow is very subjective term of course.

I have got 110K requests per second with nginx serving static file and utilizing 100% of all cores; and 26K with epoll/SO_REUSEPORT dynamically generating content and utilizing 80% of all cores. Running with pypy gave me roughly same result. Most CPU was spent in kernel.

Primitive server in Python, almost copy-pasted from tutorial, is four times slower than server written in C with the sole purpose of being fast. Not apples to apples comparison, but anyway that’s very good result to me! I think Python is quite fast for network servers on practice, since CPUs vastly outperform HBAs. Of course not a lot of people need even 10K RPS, not even 1K RPS, and my dream of 10K full blown HTTP requests per second serves my curiosity more than any of my needs. 10K RPS is a lot! But having fast technology stack is always pleasant, so I still want to do everything to get 10K RPS with Trio.

I used py-spy. And yeah, cProfile can be really misleading for tight loops, because it adds so much overhead that you basically just end up measuring how much overhead cProfile added.

I was all set to write something here about how this just wasn’t realistic, but then I did the math and… maybe?

On my laptop, the current epoll loop gets 16k req/sec, so each request takes ~60 µs. If you want to stay above 10k req/sec, then you have a budget of 100 µs for each request. So that leaves 40 µs to implement a friendly, full-featured async library API. A dict lookup is ~0.3 µs; an empty method call is ~1 µs; suspending/resuming a coroutine call stack is ~1-2 µs. So… it’s super tight, but maybe doable?

1 Like

You overcomplicated. My actual target is easier. 10K RPS on let’s say four cores with SO_REUSEPORT. It’s just 2500 RPS per core, thus 400µs for each request. If friendly full-featured async library API fits into 100µs we have 300µs more for everything else. It’s not easy, but a bit more than maybe :smiley:

Huh, actually I made a silly mistake here and was pessimistic by a factor of 10. A dict lookup is ~0.03 µs, suspending/resuming a coroutine call stack is ~0.1-0.2 µs, etc.

I suspect there are actually a smallish number of things we’re doing that are adding undue overhead. Some possibilities that come to mind:

  • Randomizing the scheduling order: with a 100 item list, randobj.shuffle(l) takes ~60 µs (!!). OTOH, doing if randobj.random() < 0.5: l.reverse() would also accomplish the goal (= preventing people from accidentally relying scheduling details as a form of cross-task synchronization), and on our 100 item list it only takes ~0.2 µs.

  • There’s some silly overhead in trio’s socket send and recv, because of being too clever with code reuse. I mean, the code reuse part is fine, but it adds layers of indirection that are slow. Writing the code out more straightforwardly (maybe via some code generation) would be substantially less overhead.

  • Our cancellation machinery is substantially slower than it needs to be – not enough to matter much in the 1K RPS range, but maybe enough to matter at 10K RPS. We want to fix this anyway. See: [RFC] Refactor cancellation for great justice by oremanj · Pull Request #910 · python-trio/trio · GitHub

  • Right now we store timeouts in a simple sortedcontainers.SortedDict, to allow for O(1)-ish insertion, deletion, and peeking/popping the smallest value. This is simple and works fine for lower RPS, but it’s complicated, and it might eventually be better to replace it with a more complex data structure. Maybe: a list, managed by the heapq module for the basic insertion/peek/pop operations, and then handle deletions by tracking how many stale entries are in the list, and when there are “enough” stale entries then doing a scan through to delete them all (to amortize out the cost).

1 Like

I keep trying to find something on lower (lowest?) level. Taking into account my previous failures to write optimal code, I recheck everything many times and it, you know, takes time.

Looks like just using selectors module already cuts 1/3 of RPS, 13700 vs 9200.
I know trio does not use selectors module, but curio does and trio/curio results are similar. Also it’s questionable if trio’s epoll wrapper is better performance wise. I’ll try to find this out next week.

Looks like PRAGMA reverse_unordered_selects; to me .

Is it even relevant for success path?

Oh neat, I didn’t know about that. Yes, it’s a very similar idea :slight_smile:

Yeah, every time a task sleeps or does I/O, it has to check whether it’s been cancelled, which currently involves some non-trivial Python logic. In the future though it will probably become something very cheap like if task._cancel_status.cancelled: ....

Do I understand right, that trio registers and then unregisters with epoll for each accept call? So while my primitive epoll test registers server socket for reading just once, trio registers and unregisters server socket once per accepted connection. Looks like so. While I understand why, it’s still sad.

When I applied this change to my epoll test code, connection rate immediately dropped from 13K RPS to 9K RPS.

To be honest, when I think about it, I conclude that providing synchronous style API to asynchronous code is not that great idea.

I’d rather have socket.listen(backlog, accept_handler), maybe socket.listen(backlog, nursery, accept_handler), which is not that different from

while True:
    client_socket, client_address = await socket.accept()
    nursery.start_soon(accept_hander, client_socket, client_address)

but just this simple design decision leads to 30% faster code.

I’ll try to come up with some working prototype of clean async API which is as fast as raw epoll. Hope it will not eventually become yet another alternative async library for Python :slight_smile:

My general philosophy is to make the API usable, and then figure out how to make the implementation fast :-). There are a lot of tricks for speeding up code you can try before you should conclude that the API has to change.

Curio has an optimization where it detects if a task sleeps on an epoll event -> wakes up -> them goes back to sleep on the same epoll event. In this case it skips unregistering/re-registering the epoll event. We haven’t implemented this for trio yet (mostly because it’s not clear to me what the best way to do it is, and whether it’s a net win for more realistic apps). But we could.

For the specific issue of accept loops, there’s also a much simpler trick: you can accept multiple connections after each epoll wakeup. See https://github.com/python-trio/trio/issues/14

This probably is why Curio performs better than Trio in my many-many-new-connections tests.

This actually means that at least one, the last, accept in the loop was not required. So instead of 3x syscalls you now have 2x at most syscalls. It’s better, but still suboptimal.

I understand motivation behind your design. Providing proven and familiar API is safe bet.

Could be. We can’t be certain without measuring, though. That test exercises a number of different code paths – scheduling, task creation, task teardown, … and at high incoming connection rates, trio might not be using epoll to accept connections at all.

When you call await sock.accept() then trio first tries calling accept, and if there’s already a socket ready to accept then it doesn’t touch epoll at all. Why should it – there’s nothing to wait for. It does still invoke the scheduler, though, to avoid accidentally starving out other tasks. Curio doesn’t invoke the scheduler in this case, which helps in some benchmarks but also means you can get really pathological behavior in other cases. There’s an issue on curio’s issue tracker where I measured that in their echo server benchmark, this optimization applied to recvand send lets them get something like ~10-20% more throughput, but if you have two clients connected to the same server, then one gets 100% of the bandwidth and the other gets 0%. Performance is complicated :slight_smile:

I think this analysis is too simplified. For a server that does nothing but accepting connections, under high accept load, I think trio currently averages ~2 syscalls per accept: the actual accept, plus an epoll_wait for each pass through the scheduler. (Because of what I said above.) If it did batched accepts, and on average each batch had N connections, then each batch would do N+4 syscalls (the actual accepts, plus registering, unregistering, and epoll_wait). So on average we have (N+4)/N syscalls per incoming connection. Depending on N, this might be much smaller than 2. Eliminating the register/unregister would reduce it to (N+2)/N. This might be substantial if N is small, or absolutely trivial if N is large.

But again, I wouldn’t assume that syscalls are the only important cost – they might be very cheap compared to other things in trio’s scheduling loop. That’s why it’s so important to always measure and confirm your hypotheses.

My progress so far.

I’ve successfully implemented event loop which supports server sockets and timers and performs as fast as raw epoll. Actually just a bit faster and I have no idea why, but even more glad. Not quite ready to publish code yet.

Nice tracebacks are potentially supported since every task remembers parent task from which it was spawned.

Waiting for task completion, task cancellation, nurseries or any kind of event objects are not implemented. While I foresee no problem in that, I still want to implement some concept to be sure it will not affect I/O performance.

As far as I researched any kind of sync-style “while True: socket.accept()” loop drops performance by at least 1/3, which is huge. Actually accepting fast enough is quite critical. According to my tests after each epoll wakeup batch of more than 20 new connections is accepted on average, up to 80 connections in peak.

Regarding API style, it’s not that bad. Sending and receiving looks usual. I managed to keep socket registered in epoll if task keeps recieving or sending data. Accepting connections is different story. At first accepting looked like this

await server_socket.listen(backlog, connection_handler)

and to be honest I did not like this style much. There was no explicit call to accept, which was highly confusing, and API provided no obvious way to stop listening, except closing socket from another task. So I think that new style

server_socket.listen(backlog)

while is_running:
    server_socket.accept(connection_handler)

is better, because this code is simpler, structure is more familiar and it is just as fast as old one. I have found no way to keep synchronous style API without significantly degrading performance.

I’ll keep updating if that’s interesting?

Looks like having beautiful code is not free at all, when it comes at I/O.

Loop which has everything in one function (four hundred lines) and accesses only local variables is as fast as plain epoll code.

Loop with just a bit of refactoring (four major classes instead of one) is 30% slower. Just moving task queue from local variable to class attribute costs ~10% of performance.

While results produced by both variants are the same, I additionally verified I made no errors by logging all syscalls. Syscalls are the same (parameters, sequence) for both.

Fast code is fast, otherwise terrible. Lot’s of very similar code. Maybe comments will help, but I do not feel code is maintainable the way it is. This total mess I have now reminds me of bad days of my C programming.