Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: 1M rows/s from Postgres to Python (magic.io)
296 points by 1st1 on Aug 4, 2016 | hide | past | favorite | 89 comments


At 3.7GHz that's roughly 3700 clock cycles per row. With something that low, I'd be really interested to see a breakdown of how many cycles were spent doing what.


In the 1M benchmark roughly half is spent decoding rows (in [1]). The rest is spread over interpreter overhead, GC in particular.

[1] https://github.com/MagicStack/asyncpg/blob/master/asyncpg/pr...


This is really awesome, I'm going to give it a try in my latest project. Has there been any apps/sites with large amounts of i/o running this yet? Curious how reliable it is at this stage.


We've open sourced asyncpg just a couple of weeks ago, and right now it has a beta status. Not that we're aware of any bugs in it or plan to add a lot of new features. So I guess you should give it a try, carefully ;)


> For fairness, all tests were run in a single-thread (GOMAXPROCS=1 for Go code) in async mode.

Isn't this just hobbling Go so you can beat it? Doesn't seem like something done "for fairness" at all.


Python and nodejs applications are usually deployed in a multiprocess configuration (and Go apps without GOMAXPROCS). So in production, applications in all of those languages use all CPU cores.

For benchmarks, we didn't want to complicate things with multi-process setup, the idea was to compare the raw performance of all drivers.


"usually deployed" where? Every Python app I've seen in Prod don't use multi process to bypass the limitation of GIL/#cores.


Are you kidding?

I run 80 servers with custom written daemons and it's always multi-process.

I just finished said infra at one of the worlds largest online betting sites, their legacy was multi-process and the new systems are multi-process.

This is really common


They don't use the multiprocess module but most of them are deployed with something like uWSGI or Gunicorn, which spawns the worker processes and forwards requests to them and the responses back.


supervisord


GIL only applies to multi-threaded, not multi-process. Multi-process is a great way to overcome the GIL


At a cost of memory.


Does this library address any of the issues/concerns that Mike Bayer discussed in this earlier blog post?

http://techspot.zzzeek.org/2015/02/15/asynchronous-python-an...


the 1M row/sec part is easy to explain:

> asyncpg extensively uses PostgreSQL prepared statements. This is an important optimization feature, as it allows to avoid repeated parsing, analysis, and planning of queries. Additionally, asyncpg caches the data I/O pipeline for each prepared statement.

it would be great if psycopg2 did this, at least within the context of executemany(). You don't need async to use prepared statements. But even if they did this, it wouldn't be this amazing speed-everywhere kind of thing, just for when you do a bulk INSERT/UPDATE/DELETE. The prepared statement is linked to a cursor and a statement handle so it's not like you can have hundreds of these just queued up in a cache without a great deal of complexity.

The TL;DR; for my blog post you mention is that async itself usually has little to do with speed when talking to a database, because the database is typically on the same network as your application, you're using typically no more than a few dozen connections per process (postgresql uses a process per connection, so using thousands of connections isn't feasible anyway), so the overhead of async and especially Python 3's async IO slows things down considerably without offering any opportunity to regain that speed. The GIL doesn't block on IO so threads work just fine, and most applications are CRUD / transactional applications which means you can't spawn off several dozen queries within a transaction at the same time anyway, you need to wait for each one to complete serially regardless.

I'd be very wary of a new PG driver especially one that implements the wire protocol itself. pg8000 does this, and also does prepared statements, and while it doesn't have the speed approach here, it also has serious problems supporting all the features and datatypes that psycopg2 does. It's been around for years and reliably breaks on many Postgresql features (and Postgresql introduces new features like a firehose). Writing and maintaining a driver for Postgresql is not an easy task.


> the 1M row/sec part is easy to explain:

Prepared statements were used in all benchmarks where supported. Please don't dismiss the results so easily. The source of performance increases here is not the use of prepared statements, but efficient implementation of the binary type I/O and the protocol itself.

> I'd be very wary of a new PG driver especially one that implements the wire protocol itself. > pg8000 does this, and also does prepared statements, and while it doesn't have the speed approach here, it also has serious problems supporting all the features and datatypes that psycopg2 does.

asyncpg supports 100% of pg_catalog types (we have a test for that.) Additionally, psycopg2 doesn't _really_ support all the features and datatypes and simply returns unparsed text in many cases. You have to write lots of typecasters to convince it otherwise.

> It's been around for years and reliably breaks on many Postgresql features (and Postgresql introduces new features like a firehose). Writing and maintaining a driver for Postgresql is not an easy task.

Please. Simply because something has been around is not the reason to stiffle new development and innovation.


I have no intention to stifle anything, I'm only raising the point that this seems like a really new driver and it's likely that it would take a while before it approaches the stability of psycopg2. That is, the stability of psycogp2 is not to be so lightly thrown in the trash.

Does your driver seamlessly handle reading and writing of three-dimensional arrays of JSONB structures, and arrays of custom enumerated types ?

edit:

> but efficient implementation of the binary type I/O and the protocol itself.

that's great too! you don't need async for that either.

I guess I am frustrated that you folks put all this work into a great driver and then locked it in the largely useless dungeon of asyncio, where all the libraries are totally useless to those of us who don't think node.js is the programming model for all cases.


> I have no intention to stifle anything, I'm only raising the point that this seems like a really new driver and it's likely that it would take a while before it approaches the stability of psycopg2. That is, the stability of psycogp2 is not to be so lightly thrown in the trash.

We in no way question the merits and the stability if psycopg2. We created asyncpg to solve the challenges we face in the development of EdgeDB, it's not just a toy project for fun.

> Does your driver seamlessly handle reading and writing of three-dimensional arrays of JSONB structures, and arrays of custom enumerated types ?

Yes it does.

> edit:

> > but efficient implementation of the binary type I/O and the protocol itself.

> that's great too! you don't need async for that either.

> I guess I am frustrated that you folks put all this work into a great driver and then locked it in the largely useless dungeon of asyncio, where all the libraries are totally useless to those of us who don't think node.js is the programming model for all cases.

I'm sorry to hear that. Nobody forces you to use asyncio or asyncpg. But please don't try to claim that something is a "useless dungeon" simply because you disagree.


The tone used is indeed not the proper one, but while I'm a big asyncio advocate, I think it would be beneficial if you would split your project in 2, separating the IO from the rest (like hyper does https://github.com/Lukasa/hyper).

The rational being:

- Python is not just async. It currently majoritarly sync. A lot of the community would benefit from your awesome project; - You'd have contrib from the async and sync community; - async may sometime not be worth it for db (http://techspot.zzzeek.org/2015/02/15/asynchronous-python-an... - some people will prefer to use threads or gevent; - you will see compat plugin created for current ORM;

That aside, I will use your project as is, since I'm deep in asyncio these days. Thank you.


> I think it would be beneficial if you would split your project in 2, separating the IO from the rest (like hyper does https://github.com/Lukasa/hyper).

I'd like to do that, but we don't have resources. If we had a couple capable devs volunteering on the GH to do this, I'd be glad to review patches and guide the development.


I get that. I have the same problem. And it's not just the ressources, what you need are skilled enough ressources, and that's even harder.


> I guess I am frustrated that you folks put all this work into a great driver and then locked it in the largely useless dungeon of asyncio, where all the libraries are totally useless to those of us who don't think node.js is the programming model for all cases

I see your point. Nothing is really "locked away". The protocol parser and data codecs in asyncpg are independent from the I/O model and loop semantics. With effort, it's possible to build a sync interface on top. It's just not a priority, as our cases are async.


and it's just as frustrating to those of us that do use asyncio (because it really is excellent) when someone needlessly ties it to not use async. it's harder to maintain something that does both


> The TL;DR; for my blog post you mention is that async itself usually has little to do with speed when talking to a database, because the database is typically on the same network as your application, you're using typically no more than a few dozen connections per process (postgresql uses a process per connection, so using thousands of connections isn't feasible anyway), so the overhead of async and especially Python 3's async IO slows things down considerably without offering any opportunity to regain that speed.

This simply isn't true for all use cases. If you have to manage 100-1000s of connections (some of each can be long-living sessions such as websockets) you have no other option but to use the async approach.

To minimize the number of connections to the DB you should use a decent caching layer (redis/memcached will do). Once you have that, and a use a connection pool, you'll be able to do a lot more with asyncio than with threads.

> The GIL doesn't block on IO so threads work just fine, and most applications are CRUD / transactional applications which means you can't spawn off several dozen queries within a transaction at the same time anyway, you need to wait for each one to complete serially regardless.

If you have 10 threads and relatively low server load - the threads will work just fine. However GIL will make things really slow under load.


> This simply isn't true for all use cases.

Everytime I talk about my blog post, I am super careful to qualify: in this case, "usually has little to do w/ speed" and other qualifications. The blog post as well. But everytime, I get a response, "but that's not always TRUE!" We agree. It is not always true. But it is usually true :).

> If you have to manage 100-1000s of connections

Which I said, usually you are not. Postgresql connections are expensive. You do not want 100s-1000s.

> some of each can be long-living sessions such as websockets

you use websockets to talk to your Postgresql database? Where did I say I was talking about HTTP web services ?

> However GIL will make things really slow under load.

not IO load with a hundred threads or so. Thousands, sure. Otherwise no.


> Everytime I talk about my blog post, I am super careful to qualify: in this case, "usually has little to do w/ speed" and other qualifications. The blog post as well.

I created an asyncpg suite for your bigdata benchmark (https://bitbucket.org/zzzeek/bigdata). Here are the results obtained from running it on my laptop against local PostgreSQL 9.5.

300 connections:

    Python3.5.2 threads -- avg 9798.16 recs/sec
    Python3.5.2 gevent -- avg 9677.84 recs/sec
    Python3.5.2 uvloop/asyncpg -- avg 11558.24 r/sec
The modified suite is here: https://github.com/elprans/bigdata


that's great, you've written a very fast driver. My benchmarks are comparing psycopg2 to itself, under threads / gevent vs. asyncio, to show that asyncio is inherently slower. It follows that if you wrote your driver and all of its protocol improvements into a traditional blocking model, it would also be much faster. You've written something so fast that it overcomes the latency of asyncio (this whole thread is a huge plug for your driver, btw).

It's not that the latency of asyncio is really that big of a deal. The point I try to make is, asyncio != speed. For databases, 95% of the time, it means, a little less speed. Not a big deal, but if you aren't into writing explicit context switching directives throughout 100% of your code when the OS does it for you just fine, it isn't worth it, unless you are in the zone of thousands of arbitrarily performing connections, which is really more of an incoming client connection thing, not usually a database thing.


> that's great, you've written a very fast driver. My benchmarks are comparing psycopg2 to itself,

OK, I ran the asyncio suite (with uvloop event loop):

Python3.5.2 asyncio+psycopg+uvloop -- avg 8356.84 recs/sec

Really not that slower. There is also an important detail that you forget to mention: the threaded benchmark uses notably more CPU time (about 30% more).


>> If you have to manage 100-1000s of connections

> Which I said, usually you are not. Postgresql connections are expensive. You do not want 100s-1000s.

I was talking about a micro-service that has to support a lot of connections and work with a DB.

>> some of each can be long-living sessions such as websockets

> you use websockets to talk to your Postgresql database? Where did I say I was talking about HTTP web services ?

So you were talking about programs that don't communicate with the outside world except to a DB?


If you write a program that has thousands of persistent websockets open, you do not want all those thousands of connections to each have their own, persistent, long lived database transaction. You will not like your process listing nor will you be happy about the amount of table and row locking going on. Transactions and connection scope should be short lived, and if you're using a connection pool, you definitely are going for that. As my blog post stated, you'd use a thread pool for the database access part, given that the http-facing side is async.


> Mike, I've never suggested to have a DB connection per HTTP request.

right, so my suggestion is, keep the DB code in a thread pool. So you can use the async for your 1000-client frontend and not force it into the DB layer where it makes things much more complicated (or what I do, just use gevent :) ).


Sorry to interrupt: You two can click on the timestamp to answer your posts. Please continue, we are listening carefully.

And since I am already here @1st1: How hard would it be to create a synchronous version of your adapter?


A lot of code in asyncpg is datatypes parsers, buffer abstractions, API facade and other IO independent things. So good news is that it would be possible to reuse a lot of code.

Protocol is implemented in two layers -- the base layer is IO and framework independent. The higher level is designed for asyncio.

Long story short, it's possible to add a synchronous version, but it still will require a lot of work. The API will also have to be split in two.


It's possible to make a really nice async sqlalchemy core too, and even some of the ORM. huge amount of work, i started on just the connection pool. Would be nice but require all its own test coverage and everything. If i was the CEO of sqlalchemy inc, it would have been done. But that's not where my salary comes from :)


SQLAlchemy is clearly one of the best ORMs (maybe there is something in Java-land that I don't know). It is beyond me how you are not guilted by cash getting thrown in your direction into making this your job.

Well, I like what bountysource.com and others are doing, maybe we will get there.


It's the tragedy of open source. See - https://news.ycombinator.com/item?id=10905845


I think you should open a bounty for it. As questions like "Can I use this with Django/SQLAlchemy/…?" showed, people love drop-in replacements that speed up their code, especially when it is at a low level like this.

Do not want to sound ungrateful, though. Thank you for your work also on you other contributions. Would definitely use this when applicable. Are more performance related projects like this and uvloop planned?


> I think you should open a bounty for it. As questions like "Can I use this with Django/SQLAlchemy/…?" showed, people love drop-in replacements that speed up their code, especially when it is at a low level like this.

We'll consider doing that. We have to keep in mind that supporting two IO modes will add a lot of maintenance overhead. Anyways, we can certainly continue the discussion on GH, and maybe we can find a champion to do the heavy-lifting.

> Do not want to sound ungrateful, though. Thank you for your work also on you other contributions. Would definitely use this when applicable. Are more performance related projects like this and uvloop planned?

Thank you.

Right now most of our resources are consumed by getting EdgeDB alpha out as soon as possible. We also have plans to create a micro web framework based on httptools and few other exciting ideas. Stay tuned :)

Also check out PEP 525 -- another cool thing we're trying to make happen in Python 3.6.


> right, so my suggestion is, keep the DB code in a thread pool. So you can use the async for your 1000-client frontend and not force it into the DB layer where it makes things much more complicated (or what I do, just use gevent :) ).

Looks like a totally valid suggestion for folks who write async code and want to continue to use SQL Alchemy.

However, if you don't use SQL Alchemy (or another ORM) and just want to work with the DB directly in an async/await code base, having a threadpool will only needlessly complicate things.


> If you write a program that has thousands of persistent websockets open, you do not want all those thousands of connections to each have their own, persistent, long lived database transaction. [..]

Mike, I've never suggested to have a DB connection per HTTP request. Please re-read my other comment in this thread.


asyncpg was developed by @redcrowbar and me. Ask us anything ;)


Great work, thanks for sharing. Very impressive speed.

I didn't know about uvloop, that's a cool project:

http://magic.io/blog/uvloop-blazing-fast-python-networking/


Yep, it's also much more stable now, 3 months after that article was posted. I plan to make a new blog post about it soon.


Do you have any numbers on queries that return millions of rows? If it's still fast, asyncpg could be nice for some ETL-type workloads.


We haven't benchmarked this specific use case. However, cursors in asyncpg do data prefetching (the exact number of prefetched rows is configurable), so I expect that asynchronous iteration should be pretty fast (and still faster than psycopg2).


May be I am reading the benchmarks wrong, but does it mean I can get a 2x speedup in my end to end query execution from moving from aiopg to asynpg alone without any server side changes or query optimizations?


We intentionally benchmarked simple queries to measure the overhead of the driver implementation. So, the 2x speedup really means 50% reduction in overhead.


I'd love to read the docs for the protocol itself, is it planned? Do you have a link explaining it otherwise?

Great work!


The protocol is documented in the official PostgreSQL documentation: https://www.postgresql.org/docs/current/static/protocol.html


I meant a doc explaining roughly the components at https://github.com/MagicStack/asyncpg/tree/master/asyncpg/pr... and how they interact with each other at a high level as it doesn't seem to have comments.

Thanks for the link though, I'll have a look


Yeah, documenting internals sounds like a good idea. At least to make it easier for people to contribute and further improve asyncpg. We'll try to find some time to do that.


How can I use it with SQLAlchemy instead of psycopg2?


SQLAlchemy ORM is not async, so there is no way currently to use any async dialect for it. However, it is possible to write an SQLAlchemy Core adapter for asyncpg.


It might even be easier than writing an adapter if you just use SQLAlchemy Core's DSL as the generative mechanism for a SQL statement that is sent to asyncpg.


There appears to be an adapter for Core already: https://github.com/CanopyTax/asyncpgsa


Author of asyncpgsa here, I am in the process of writing that, let me know if you have questions.


Does this automatically escape/convert things like datetimes in a query or does that need to be done separately? The docs are a little sparse on details.

Looks fantastic anyway!


Yes, query arguments are converted automatically. Results are converted into Python types as expected (including arrays and composite types).


Bit offtopic, but what are the plots made with?


Handmade with d3.js


Fastest setup of a PG client, to date, on OSX for me. No nuances or other package installers like Fink necessary.


Using Cython means you don't/can't use Pypy? Was that something you considered, e.g. with psycopg2cffi or a pure-python driver? It seems to be the defacto performant runtime for Python, and since performance is something you guys obviously care a lot about, I'm surprised it hasn't been mentioned in this thread.

Very excited to try out your code - thanks for sharing it.


A pure-Python implementation would probably never reach this level of performance, even under PyPy. Cython really makes it possible to basically drop down to C and write very efficient code.

That said, once PyPy supports 3.5, we will consider making a CFFI binding.


By using python asyncio, you can't use pypy because pypy doesn't support 3.5 yet (or probably for a while), so by going async, you are eliminating pypy right there.


Why is node pg so slow relative to go and asyncpg drivers ? I'm sure there is room for improvement here.


Primary author of node-postgres (both the JavaScript & libpq versions) here. Very good question. I'd say part of the relative speed is because it's been a mostly 1 person (me!) project for a loooong time & I haven't had a lot of time to do perf analysis to find where exactly to make it faster. I would absolutely love some help! My time for open source is pretty limited, and I try to spend most of it focused on triaging issues. If you'd like to help out please don't hesitate to open an issue, a pull request, or send me a gchat (contact info is in node-postgres readme) and we can discuss!


would be cool to also make jmh tests against jdbc and async pg on java.

actually what i also found is that python wasn't slow in these cases but it was slow when you do map operations on the lists and reoder the data or convert it to other structures.


> would be cool to also make jmh tests against jdbc and async pg on java.

Pull requests to https://github.com/MagicStack/pgbench are welcome :-)


We've been using Python (2.7) and Postgres, where I work, and I must say I really miss the async feature. It's really convenient to be able to parallelize and saturate I/O very quickly. Right now, we're using a combination of threads and processes without proper messaging, which leads to deadlocks and other nasty things. That's one thing I've been quite pleased with, when using Scala.


Awesome work! Can't wait to see more about EdgeDB.


would this work with django?


asyncpg was designed for asyncio (and async/await). So it can be used in any asyncio, Tornado (and soon Twisted) program.

Django doesn't support asyncio, it's built for blocking IO model. So, unfortunately, it's not possible to use asyncpg with it.


But could be if the code is made to lock? ie, as if the async part not matter?


Theoretically yes, but it would require is to seriously re-architecture the protocol implementation to support several IO modes.


I know its not related to the Asyncio - for Django; you should follow this - http://channels.readthedocs.io/en/latest/index.html


I wouldn't mind knowing this as well!


Wonder how this (Erlang driver) would stack up in the benchmarks (devel branch): https://github.com/epgsql/epgsql


Consider making a PR to https://github.com/MagicStack/pgbench :)


How do you develop your Cython code? Do you use PyCharm? Did you start off from pure Python code? How easy is debugging and refactoring?


I use SublimeText with a Cython highlighter; @redcrowbar uses Atom. I started coding uvloop & asyncpg directly in Cython, basically from scratch.

Debugging & refactoring are pretty much as easy as for pure Python code (which is a huge win over C). Sometimes when you're writing low-level code you can get a segfault, in which case lldb (or gdb) help you understand what's going on relatively quickly.


So do you use cygdb to debug Cython code or directly debug the generated C-API code?


I usually debug the generated C code directly (usually it's quite easy to understand the bug once you see where the program crashes.)


Cool! Wish there would be a Django wrapper, but then realised it is designed for asyncio


What about the complexity of the rows? For instance, how many columns per row?


12 in the particular case of 1M rows/s. The details are available in the benchmark report: http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-python...


Would this work/be binary-compatible with Amazon Red Shift?


Should be. AFAIK psycopg2 works with Redshift just fine, so should asyncpg.

The only thing I'm not sure about is what kind of authentication mode Redshift is using. asyncpg doesn't support all of them, but that's easy to fix.


Haven't tried, but, AFAIK, redshift speaks the standard PostgreSQL protocol, so asyncpg should work.


i'd love to see raw, command-line psql added to these benchmarks


time for a rollback uUber




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: