Why are Python sockets so slow and what can be done?

njs · March 27, 2019, 4:40am

Hey, sorry for the slow reply! I was mostly offline for the last few weeks…

Yeah, it’s tough

This is partly a problem with looking at percentages… if you’re doing almost nothing, then adding almost anything with cause a large relative increase. But remember your original goal was just 2500 RPS You have some budget for writing Python instead of C!

Which is good, because there’s another problem… async libraries live and die by their features+ecosystem. There’s only such mindshare to go around, and being really fast will get you some attention, but to build a sustainable community you also need docs, tutorials, debuggers, pytest plugins, HTTP servers, HTTP clients, websockets, memcached, postgres and mysql, templating systems, Windows support, REPL integration, etc. etc. etc. etc. This is really hard in general, and even harder when potential collaborators get scared off by 400-line functions :-/

Any updates? What are you working on now?

adontz · March 27, 2019, 6:26am

I agree about ecosystem. async HTTP server is easy. async PostgreSQL is not that easy at all. I’m still lying to myself what I do is just an experiment and I will eventually throw out entire code. So I try to not love it too much Will publish everything on github soon. I specifically mark performance related decisions with SPEED tag in comments. BTW, I hereby grant you explicit permission to shamelessly steal everything you like

It turns out, some careful line-by-line refactoring with mandatory performance checks is possible. Some innocently looking code may significantly degrade performance. Method calls are terribly expensive, but I rewrote code into much more readable version anyway.

Also, fun fact, writing exhaustive comments helps finding subtle bugs, because I know what I wanted to do, I write it down in plain words, compare to nearby code and oh, crap, it does not match that precise. Bug found!

Plans/Ideas/News

More unit tests are necessary. Highest priority for now.
Support select. Select will give me some Windows support. I do not really think someone writes high performance Python code for Windows, not me for sure, but just being able to run is nice.
Not sure about kqueue, I have no experience with FreeBSD. Probably will implement in virtual environment some day. It looks like I will have to emulate epoll logic with IOCP and kqueue to make things simple. Not general case for sure, just what I need. For now it looks like epoll’s “ready to write” logically equals kqueue/IOCP’s “previous write completed or no write was ever started” or on other words “zero write requests are pending”. And that second definition is important, because maybe optimal performance can be achieved only with “No more than N requests are pending” with some surprising value of N.
My nurseries definitely need exception handling policy. Should I kill all children if one of them fails or wait for others and let them complete? Should I allow starting more children of one of children failed? I have no idea what are right answers.
I intentionally avoid implementing SSL for now. I think it’s too tricky and want everything else complete. Also intentionally avoid pipes, UNIX sockets, etc. Only TCP and UDP.
No timeouts implemented yet. Should be easy when I’ll decide what to do exactly. Thinking about relationship between Nursery and TimeoutScope .
Implemented nice tracebacks. THANKS for traceback constructor, you are awesome!

adontz · April 9, 2019, 7:02pm

So far,
14K RPS on Linux (epoll)
7K RPS on Windows (select)
Timeouts to be implemented. A few bugs to be fixed.

adontz · April 13, 2019, 7:29am

So, here is the library

Known limitations/features:

Most operations are O(1). “ab -c 100 -n 50000” gives around 10K RPS, usually more, never less than 7K in any environment I have access too. I have totally reached my performance goal.
Library is not general purpose, my interest is implementing platform independent network servers, like HTTP. Because of that some POSIX specific calls, rarely used in given context, like socketpair, are not supported and I am not very motivated to implement them anyway.
EPOLLPRI, EPOLLHUP, EPOLLRDHUP are not processed. EPOLLHUP worries me the most. I have no simple unit test idea for EPOLLHUP, that’s why.
socket.shutdown is not implemented. I have no simple unit test idea for shutdown, that’s why.
I don’t know how to emulate “raise … from …” for coroutines, that’s why NurseryError sometimes is less informative.
IOCP and kqueue are not supported. On the other hand select is supported and performance under Windows is not that terrible.
Only stream UNIX sockets are supported.
Server TLS is not implemented. I don’t know any way to implement handshake in non-blocking, async friendly manner. I believe putting nginx in front of python web server and reverse proxying by UNIX socket is the best practice, so I am not very motivated to implement server TLS anyway.
Client TLS is not implemented. I don’t know any real use case, so I am not very motivated to implement client TLS anyway.
Nursery serves as TimeoutScope too. SRP violated.
I am not sure if passing Nursery to siblings should be considered a valid behavior. For now, it will prevent valid cancellation on exception.
There is not way to wait for task completion, only for Nursery.

adontz · April 19, 2019, 6:27pm

I’ve found strange bug. If I create nice traceback and throw exception, state of fake parent coroutine, in particular cr_await attribute, is reset. Will dig into that later, disabled tracebacks for now.

adontz · April 27, 2019, 8:54pm

3 Did not really have to process EPOLLPRI/EPOLLHUP/EPOLLRDHUP for network. EPOLLERR is enough.
4 socket.shutdown implemented.
5 Assigning _ _ cause _ _ attribute of exception object did the job. Perfectly documented in https://www.python.org/dev/peps/pep-3134/
8 Server-side TLS is implemented. TLS handshake is damn expensive. Performance is like
HTTP without keep-alive: 12K RPS
HTTP with keep-alive: 20K RPS
HTTPS without keep-alive: 150 RPS
HTTPS with keep-alive: 10K RPS
9 Client-side TLS is implemented

Going to focus on bugs, then implement nice thread pool.

njs · June 9, 2019, 5:11am

Partly inspired by this thread, I just wrote some notes on how to do better benchmarking for Trio.

It would be fun to see how this hypothetical test client behaves on trio vs broomio

Topic		Replies	Views
Status of the urllib3 port Show off your work	1	1153	June 20, 2022
Browser-based Trio monitor Internals	4	876	August 19, 2019
Eliot, the causal logging library, now supports Trio Announcements	15	2819	December 18, 2019
Discussion: "Trio: async concurrency for mere mortals" (talk at PyCon 2018)	0	2348	February 6, 2019
uWSGI + ASGI + trio Show off your work	0	803	December 13, 2020

Why are Python sockets so slow and what can be done?

Related topics