Why are Python sockets so slow and what can be done?

Hey, sorry for the slow reply! I was mostly offline for the last few weeks…

Yeah, it’s tough :frowning:

This is partly a problem with looking at percentages… if you’re doing almost nothing, then adding almost anything with cause a large relative increase. But remember your original goal was just 2500 RPS :slight_smile: You have some budget for writing Python instead of C!

Which is good, because there’s another problem… async libraries live and die by their features+ecosystem. There’s only such mindshare to go around, and being really fast will get you some attention, but to build a sustainable community you also need docs, tutorials, debuggers, pytest plugins, HTTP servers, HTTP clients, websockets, memcached, postgres and mysql, templating systems, Windows support, REPL integration, etc. etc. etc. etc. This is really hard in general, and even harder when potential collaborators get scared off by 400-line functions :-/

Any updates? What are you working on now?

I agree about ecosystem. async HTTP server is easy. async PostgreSQL is not that easy at all. I’m still lying to myself what I do is just an experiment and I will eventually throw out entire code. So I try to not love it too much :slight_smile: Will publish everything on github soon. I specifically mark performance related decisions with SPEED tag in comments. BTW, I hereby grant you explicit permission to shamelessly steal everything you like :smiley:

It turns out, some careful line-by-line refactoring with mandatory performance checks is possible. Some innocently looking code may significantly degrade performance. Method calls are terribly expensive, but I rewrote code into much more readable version anyway.

Also, fun fact, writing exhaustive comments helps finding subtle bugs, because I know what I wanted to do, I write it down in plain words, compare to nearby code and oh, crap, it does not match that precise. Bug found!

Plans/Ideas/News

  1. More unit tests are necessary. Highest priority for now.

  2. Support select. Select will give me some Windows support. I do not really think someone writes high performance Python code for Windows, not me for sure, but just being able to run is nice.
    Not sure about kqueue, I have no experience with FreeBSD. Probably will implement in virtual environment some day. It looks like I will have to emulate epoll logic with IOCP and kqueue to make things simple. Not general case for sure, just what I need. For now it looks like epoll’s “ready to write” logically equals kqueue/IOCP’s “previous write completed or no write was ever started” or on other words “zero write requests are pending”. And that second definition is important, because maybe optimal performance can be achieved only with “No more than N requests are pending” with some surprising value of N.

  3. My nurseries definitely need exception handling policy. Should I kill all children if one of them fails or wait for others and let them complete? Should I allow starting more children of one of children failed? I have no idea what are right answers.

  4. I intentionally avoid implementing SSL for now. I think it’s too tricky and want everything else complete. Also intentionally avoid pipes, UNIX sockets, etc. Only TCP and UDP.

  5. No timeouts implemented yet. Should be easy when I’ll decide what to do exactly. Thinking about relationship between Nursery and TimeoutScope .

  6. Implemented nice tracebacks. THANKS for traceback constructor, you are awesome!:heart:

So far,
14K RPS on Linux (epoll)
7K RPS on Windows (select)
Timeouts to be implemented. A few bugs to be fixed.

So, here is the library

Known limitations/features:

  1. Most operations are O(1). “ab -c 100 -n 50000” gives around 10K RPS, usually more, never less than 7K in any environment I have access too. I have totally reached my performance goal.
  2. Library is not general purpose, my interest is implementing platform independent network servers, like HTTP. Because of that some POSIX specific calls, rarely used in given context, like socketpair, are not supported and I am not very motivated to implement them anyway.
  3. EPOLLPRI, EPOLLHUP, EPOLLRDHUP are not processed. EPOLLHUP worries me the most. I have no simple unit test idea for EPOLLHUP, that’s why.
  4. socket.shutdown is not implemented. I have no simple unit test idea for shutdown, that’s why.
  5. I don’t know how to emulate “raise … from …” for coroutines, that’s why NurseryError sometimes is less informative.
  6. IOCP and kqueue are not supported. On the other hand select is supported and performance under Windows is not that terrible.
  7. Only stream UNIX sockets are supported.
  8. Server TLS is not implemented. I don’t know any way to implement handshake in non-blocking, async friendly manner. I believe putting nginx in front of python web server and reverse proxying by UNIX socket is the best practice, so I am not very motivated to implement server TLS anyway.
  9. Client TLS is not implemented. I don’t know any real use case, so I am not very motivated to implement client TLS anyway.
  10. Nursery serves as TimeoutScope too. SRP violated.
  11. I am not sure if passing Nursery to siblings should be considered a valid behavior. For now, it will prevent valid cancellation on exception.
  12. There is not way to wait for task completion, only for Nursery.

I’ve found strange bug. If I create nice traceback and throw exception, state of fake parent coroutine, in particular cr_await attribute, is reset. Will dig into that later, disabled tracebacks for now.

3 Did not really have to process EPOLLPRI/EPOLLHUP/EPOLLRDHUP for network. EPOLLERR is enough.
4 socket.shutdown implemented.
5 Assigning _ _ cause _ _ attribute of exception object did the job. Perfectly documented in https://www.python.org/dev/peps/pep-3134/
8 Server-side TLS is implemented. TLS handshake is damn expensive. Performance is like
HTTP without keep-alive: 12K RPS
HTTP with keep-alive: 20K RPS
HTTPS without keep-alive: 150 RPS
HTTPS with keep-alive: 10K RPS
9 Client-side TLS is implemented

Going to focus on bugs, then implement nice thread pool.

Partly inspired by this thread, I just wrote some notes on how to do better benchmarking for Trio.

It would be fun to see how this hypothetical test client behaves on trio vs broomio :slight_smile: