Does this library address any of the issues/concerns that Mike Bayer discussed i...

zzzeek · on Aug 4, 2016

the 1M row/sec part is easy to explain:

> asyncpg extensively uses PostgreSQL prepared statements. This is an important optimization feature, as it allows to avoid repeated parsing, analysis, and planning of queries. Additionally, asyncpg caches the data I/O pipeline for each prepared statement.

it would be great if psycopg2 did this, at least within the context of executemany(). You don't need async to use prepared statements. But even if they did this, it wouldn't be this amazing speed-everywhere kind of thing, just for when you do a bulk INSERT/UPDATE/DELETE. The prepared statement is linked to a cursor and a statement handle so it's not like you can have hundreds of these just queued up in a cache without a great deal of complexity.

The TL;DR; for my blog post you mention is that async itself usually has little to do with speed when talking to a database, because the database is typically on the same network as your application, you're using typically no more than a few dozen connections per process (postgresql uses a process per connection, so using thousands of connections isn't feasible anyway), so the overhead of async and especially Python 3's async IO slows things down considerably without offering any opportunity to regain that speed. The GIL doesn't block on IO so threads work just fine, and most applications are CRUD / transactional applications which means you can't spawn off several dozen queries within a transaction at the same time anyway, you need to wait for each one to complete serially regardless.

I'd be very wary of a new PG driver especially one that implements the wire protocol itself. pg8000 does this, and also does prepared statements, and while it doesn't have the speed approach here, it also has serious problems supporting all the features and datatypes that psycopg2 does. It's been around for years and reliably breaks on many Postgresql features (and Postgresql introduces new features like a firehose). Writing and maintaining a driver for Postgresql is not an easy task.

RedCrowbar · on Aug 4, 2016

> the 1M row/sec part is easy to explain:

Prepared statements were used in all benchmarks where supported. Please don't dismiss the results so easily. The source of performance increases here is not the use of prepared statements, but efficient implementation of the binary type I/O and the protocol itself.

> I'd be very wary of a new PG driver especially one that implements the wire protocol itself. > pg8000 does this, and also does prepared statements, and while it doesn't have the speed approach here, it also has serious problems supporting all the features and datatypes that psycopg2 does.

asyncpg supports 100% of pg_catalog types (we have a test for that.) Additionally, psycopg2 doesn't _really_ support all the features and datatypes and simply returns unparsed text in many cases. You have to write lots of typecasters to convince it otherwise.

> It's been around for years and reliably breaks on many Postgresql features (and Postgresql introduces new features like a firehose). Writing and maintaining a driver for Postgresql is not an easy task.

Please. Simply because something has been around is not the reason to stiffle new development and innovation.

zzzeek · on Aug 4, 2016

I have no intention to stifle anything, I'm only raising the point that this seems like a really new driver and it's likely that it would take a while before it approaches the stability of psycopg2. That is, the stability of psycogp2 is not to be so lightly thrown in the trash.

Does your driver seamlessly handle reading and writing of three-dimensional arrays of JSONB structures, and arrays of custom enumerated types ?

edit:

> but efficient implementation of the binary type I/O and the protocol itself.

that's great too! you don't need async for that either.

I guess I am frustrated that you folks put all this work into a great driver and then locked it in the largely useless dungeon of asyncio, where all the libraries are totally useless to those of us who don't think node.js is the programming model for all cases.

RedCrowbar · on Aug 4, 2016

> I have no intention to stifle anything, I'm only raising the point that this seems like a really new driver and it's likely that it would take a while before it approaches the stability of psycopg2. That is, the stability of psycogp2 is not to be so lightly thrown in the trash.

We in no way question the merits and the stability if psycopg2. We created asyncpg to solve the challenges we face in the development of EdgeDB, it's not just a toy project for fun.

> Does your driver seamlessly handle reading and writing of three-dimensional arrays of JSONB structures, and arrays of custom enumerated types ?

Yes it does.

> edit:

> > but efficient implementation of the binary type I/O and the protocol itself.

> that's great too! you don't need async for that either.

> I guess I am frustrated that you folks put all this work into a great driver and then locked it in the largely useless dungeon of asyncio, where all the libraries are totally useless to those of us who don't think node.js is the programming model for all cases.

I'm sorry to hear that. Nobody forces you to use asyncio or asyncpg. But please don't try to claim that something is a "useless dungeon" simply because you disagree.

sametmax · on Aug 5, 2016

The tone used is indeed not the proper one, but while I'm a big asyncio advocate, I think it would be beneficial if you would split your project in 2, separating the IO from the rest (like hyper does https://github.com/Lukasa/hyper).

The rational being:

- Python is not just async. It currently majoritarly sync. A lot of the community would benefit from your awesome project; - You'd have contrib from the async and sync community; - async may sometime not be worth it for db (http://techspot.zzzeek.org/2015/02/15/asynchronous-python-an... - some people will prefer to use threads or gevent; - you will see compat plugin created for current ORM;

That aside, I will use your project as is, since I'm deep in asyncio these days. Thank you.

1st1 · on Aug 5, 2016

> I think it would be beneficial if you would split your project in 2, separating the IO from the rest (like hyper does https://github.com/Lukasa/hyper).

I'd like to do that, but we don't have resources. If we had a couple capable devs volunteering on the GH to do this, I'd be glad to review patches and guide the development.

sametmax · on Aug 7, 2016

I get that. I have the same problem. And it's not just the ressources, what you need are skilled enough ressources, and that's even harder.

RedCrowbar · on Aug 5, 2016

> I guess I am frustrated that you folks put all this work into a great driver and then locked it in the largely useless dungeon of asyncio, where all the libraries are totally useless to those of us who don't think node.js is the programming model for all cases

I see your point. Nothing is really "locked away". The protocol parser and data codecs in asyncpg are independent from the I/O model and loop semantics. With effort, it's possible to build a sync interface on top. It's just not a priority, as our cases are async.

rickycook · on Aug 4, 2016

and it's just as frustrating to those of us that do use asyncio (because it really is excellent) when someone needlessly ties it to not use async. it's harder to maintain something that does both

1st1 · on Aug 4, 2016

> The TL;DR; for my blog post you mention is that async itself usually has little to do with speed when talking to a database, because the database is typically on the same network as your application, you're using typically no more than a few dozen connections per process (postgresql uses a process per connection, so using thousands of connections isn't feasible anyway), so the overhead of async and especially Python 3's async IO slows things down considerably without offering any opportunity to regain that speed.

This simply isn't true for all use cases. If you have to manage 100-1000s of connections (some of each can be long-living sessions such as websockets) you have no other option but to use the async approach.

To minimize the number of connections to the DB you should use a decent caching layer (redis/memcached will do). Once you have that, and a use a connection pool, you'll be able to do a lot more with asyncio than with threads.

> The GIL doesn't block on IO so threads work just fine, and most applications are CRUD / transactional applications which means you can't spawn off several dozen queries within a transaction at the same time anyway, you need to wait for each one to complete serially regardless.

If you have 10 threads and relatively low server load - the threads will work just fine. However GIL will make things really slow under load.

zzzeek · on Aug 4, 2016

> This simply isn't true for all use cases.

Everytime I talk about my blog post, I am super careful to qualify: in this case, "usually has little to do w/ speed" and other qualifications. The blog post as well. But everytime, I get a response, "but that's not always TRUE!" We agree. It is not always true. But it is usually true :).

> If you have to manage 100-1000s of connections

Which I said, usually you are not. Postgresql connections are expensive. You do not want 100s-1000s.

> some of each can be long-living sessions such as websockets

you use websockets to talk to your Postgresql database? Where did I say I was talking about HTTP web services ?

> However GIL will make things really slow under load.

not IO load with a hundred threads or so. Thousands, sure. Otherwise no.

RedCrowbar · on Aug 4, 2016

> Everytime I talk about my blog post, I am super careful to qualify: in this case, "usually has little to do w/ speed" and other qualifications. The blog post as well.

I created an asyncpg suite for your bigdata benchmark (https://bitbucket.org/zzzeek/bigdata). Here are the results obtained from running it on my laptop against local PostgreSQL 9.5.

300 connections:

    Python3.5.2 threads -- avg 9798.16 recs/sec
    Python3.5.2 gevent -- avg 9677.84 recs/sec
    Python3.5.2 uvloop/asyncpg -- avg 11558.24 r/sec

The modified suite is here: https://github.com/elprans/bigdata

zzzeek · on Aug 4, 2016

that's great, you've written a very fast driver. My benchmarks are comparing psycopg2 to itself, under threads / gevent vs. asyncio, to show that asyncio is inherently slower. It follows that if you wrote your driver and all of its protocol improvements into a traditional blocking model, it would also be much faster. You've written something so fast that it overcomes the latency of asyncio (this whole thread is a huge plug for your driver, btw).

It's not that the latency of asyncio is really that big of a deal. The point I try to make is, asyncio != speed. For databases, 95% of the time, it means, a little less speed. Not a big deal, but if you aren't into writing explicit context switching directives throughout 100% of your code when the OS does it for you just fine, it isn't worth it, unless you are in the zone of thousands of arbitrarily performing connections, which is really more of an incoming client connection thing, not usually a database thing.

RedCrowbar · on Aug 4, 2016

> that's great, you've written a very fast driver. My benchmarks are comparing psycopg2 to itself,

OK, I ran the asyncio suite (with uvloop event loop):

Python3.5.2 asyncio+psycopg+uvloop -- avg 8356.84 recs/sec

Really not that slower. There is also an important detail that you forget to mention: the threaded benchmark uses notably more CPU time (about 30% more).

1st1 · on Aug 4, 2016

>> If you have to manage 100-1000s of connections

> Which I said, usually you are not. Postgresql connections are expensive. You do not want 100s-1000s.

I was talking about a micro-service that has to support a lot of connections and work with a DB.

>> some of each can be long-living sessions such as websockets

> you use websockets to talk to your Postgresql database? Where did I say I was talking about HTTP web services ?

So you were talking about programs that don't communicate with the outside world except to a DB?

zzzeek · on Aug 4, 2016

If you write a program that has thousands of persistent websockets open, you do not want all those thousands of connections to each have their own, persistent, long lived database transaction. You will not like your process listing nor will you be happy about the amount of table and row locking going on. Transactions and connection scope should be short lived, and if you're using a connection pool, you definitely are going for that. As my blog post stated, you'd use a thread pool for the database access part, given that the http-facing side is async.

zzzeek · on Aug 4, 2016

> Mike, I've never suggested to have a DB connection per HTTP request.

right, so my suggestion is, keep the DB code in a thread pool. So you can use the async for your 1000-client frontend and not force it into the DB layer where it makes things much more complicated (or what I do, just use gevent :) ).

allendoerfer · on Aug 4, 2016

Sorry to interrupt: You two can click on the timestamp to answer your posts. Please continue, we are listening carefully.

And since I am already here @1st1: How hard would it be to create a synchronous version of your adapter?

1st1 · on Aug 4, 2016

A lot of code in asyncpg is datatypes parsers, buffer abstractions, API facade and other IO independent things. So good news is that it would be possible to reuse a lot of code.

Protocol is implemented in two layers -- the base layer is IO and framework independent. The higher level is designed for asyncio.

Long story short, it's possible to add a synchronous version, but it still will require a lot of work. The API will also have to be split in two.

zzzeek · on Aug 4, 2016

It's possible to make a really nice async sqlalchemy core too, and even some of the ORM. huge amount of work, i started on just the connection pool. Would be nice but require all its own test coverage and everything. If i was the CEO of sqlalchemy inc, it would have been done. But that's not where my salary comes from :)

allendoerfer · on Aug 5, 2016

SQLAlchemy is clearly one of the best ORMs (maybe there is something in Java-land that I don't know). It is beyond me how you are not guilted by cash getting thrown in your direction into making this your job.

Well, I like what bountysource.com and others are doing, maybe we will get there.

infinite8s · on Aug 5, 2016

It's the tragedy of open source. See - https://news.ycombinator.com/item?id=10905845

allendoerfer · on Aug 4, 2016

I think you should open a bounty for it. As questions like "Can I use this with Django/SQLAlchemy/…?" showed, people love drop-in replacements that speed up their code, especially when it is at a low level like this.

Do not want to sound ungrateful, though. Thank you for your work also on you other contributions. Would definitely use this when applicable. Are more performance related projects like this and uvloop planned?

1st1 · on Aug 4, 2016

> I think you should open a bounty for it. As questions like "Can I use this with Django/SQLAlchemy/…?" showed, people love drop-in replacements that speed up their code, especially when it is at a low level like this.

We'll consider doing that. We have to keep in mind that supporting two IO modes will add a lot of maintenance overhead. Anyways, we can certainly continue the discussion on GH, and maybe we can find a champion to do the heavy-lifting.

> Do not want to sound ungrateful, though. Thank you for your work also on you other contributions. Would definitely use this when applicable. Are more performance related projects like this and uvloop planned?

Thank you.

Right now most of our resources are consumed by getting EdgeDB alpha out as soon as possible. We also have plans to create a micro web framework based on httptools and few other exciting ideas. Stay tuned :)

Also check out PEP 525 -- another cool thing we're trying to make happen in Python 3.6.

1st1 · on Aug 4, 2016

> right, so my suggestion is, keep the DB code in a thread pool. So you can use the async for your 1000-client frontend and not force it into the DB layer where it makes things much more complicated (or what I do, just use gevent :) ).

Looks like a totally valid suggestion for folks who write async code and want to continue to use SQL Alchemy.

However, if you don't use SQL Alchemy (or another ORM) and just want to work with the DB directly in an async/await code base, having a threadpool will only needlessly complicate things.

1st1 · on Aug 4, 2016

> If you write a program that has thousands of persistent websockets open, you do not want all those thousands of connections to each have their own, persistent, long lived database transaction. [..]

Mike, I've never suggested to have a DB connection per HTTP request. Please re-read my other comment in this thread.