Hello,
I want to share my findings and maybe find solution together.
The Performance. Very important subject to me and, I believe, to many.
I’ve being testing Curio, Trio and UVLoop on CPython 3.7 and pypy 6.0. Tests performed on cloud Fedora 29 x64 servers (no ads, but if anyone interested it was Vultr with 6 CPUs and 16Gb). I’ve user curl for validation and wrk for load test. I ran wrk from two distinct machines to make sure client side is not bottleneck. Absolute numbers are not that important, because configurations vary, but I’ll give them anyway.
At first, I’ve wrote similar HTTP server applications for all frameworks, using httptools for parsing. Logic was simple: read entire GET request, httptools tells me when it’s done, then send Jinja2 generated response with client IP address and time. This test seems more or less realistic to me. No POST support, no request body parsing, no URL routing. While all of these matters for sure, nothing is I/O layer specific, so I’ve settled with minimal valid server I can load test with existing HTTP test tools.
So I have performed the tests. 100 threads, 100 connections, 30 seconds each. Requests were 43 bytes, responses 379 bytes.
Results are quite disappointing to me. Trio and Curio are at least twice slower than uvloop, processing 4K, 5K and 11K requests per second. Trio was significantly slower than Curio, but not the point right now. Even running four process instances with SO_REUSEPORT did not help much, requests per second were higher, but uvloop was still at least twice faster. CPU load was not high and nothing I could blame in particular.
I started digging why so. I discovered very soon (OK, it was on second day, but I want to look smart) that Curio and Trio have nothing to do with performance issues. Python’s select/selectors module is not optimal. So my second test excluded Curio, Trio, anything asyncio. I’ve just created simple epoll based server and client. Client connects to server and sends data as fast as it can. Server accepts connection and reads data at fast as it can. I’ve got 300-400Mbps (that’s bytes, not bits), which I consider is very good for Python on single core. I am confident I can utilize 90% of 10G network if using all cores. But when there are many short lived connections, then results are no so pleasant. So I’ve rewritten my HTTP server with raw epoll. I’ve got 11K rps for uvloop, 5K rps for epoll. Surprisingly poll behaved better processing 6K rps. I think that it because there were never large number of handles to watch for, less than 20.
No matter what I did, played with pypy3, TCP_NODELAY all other things, I was not able to get more than 6K rps requests with poll/epoll.
I’ve increased Jinja2 template size, added some lorem ipsum so that response was 3044 bytes. While all libraries rps dropped, UVLoop was still twice faster that Trio, Curio, epoll or poll. BTW, in this test raw epoll performed a bit better than poll since there were more concurrently watched descriptors, up to 30.
My current problem is that, despite I’ve read libuv source code, I have no idea what are they doing to outperform select module that much. I am not C, or Linux networking guru, maybe I miss something very simple, but anyway, the question is still open to me and I have no clue.
My next problem depends on answer to the first one: what can be done to achieve comparable performance. Maybe 80% of uvloop will be OK, but 50% is really sad result. One may say “That’s Python, what the heck did you expect from interpreted language?!”, but I like Python too much to give up that easy or clean and fast code
P.S. Alternative async community seems sparse, so I crossposted this, sorry if that’s offense or anything.