Why Pg_dump Is Amazing

peterldowns · on Nov 9, 2024

Overall I agree with this. For dumping just the schema definitions, I'm not that happy with it. My primary use-case for a schema dump is to run it after applying migrations in dev/ci, so that if two engineers each open their own branch with incompatible schema migrations, someone gets a git conflict.

In this situation, pg_dump is really annoying because it will give you a schema-only dump, but that dump is not "stable". In the sense that

    - dump the schema
    - create a new database from the schema
    - dump the schema again

often gives you two different files in the two different schema dumps, due to minor re-ordering of tables/statements.

The other use case I rely on is just having a single place where the schema is defined is actually super helpful for debugging, and can be an entrypoint for other tools. Sometimes it's nicer to be able to open that up in your editor rather than have to explore through `psql` or read through many smaller migration files.

Because pg_dump does not generate a stable dump (same schema == same file contents), you're constantly getting noise or irrelevant changes to the file, which makes it essentially unusable or at least very annoying if you keep that file committed.

> Now, I expect that somebody is going to point out that pg_dump does not solve the dependency problem 100% reliably in all cases. And that's correct. However, I think that complaint just shows how spoiled PostgreSQL users are. Recreating a complex database schema without a dependency-aware tool like pg_dump is an absolute nightmare.

If you're interested in a tool that is dependency-aware, AND allows you to explicitly specify dependencies for the cases that pg_dump doesn't handle (primarily between functions and other objects, iirc), check out my cli tool pgmigrate [0]. You don't need to use it to manage your migrations, you can just use the `dump` subcommand. It's a little under-documented, and it's missing some obvious improvements (can't handle multiple postgres schemas (namespaces) right now, only lets you dump one at a time), but I've been relying on it and using it with pleasure.

[0]: https://github.com/peterldowns/pgmigrate

peterldowns · on Nov 9, 2024

If you're interested in trying out `pgmigrate dump`, consider going through the quickstart example --- it should give you a decent starting point https://github.com/peterldowns/pgmigrate/tree/main/example

hobs · on Nov 9, 2024

Classic problems in a state based vs migration based management of your database code. What's wrong with information schema and an order by? If you are using migrations then the creation dates should be generally in the right order as well.

peterldowns · on Nov 9, 2024

I’m not sure I understand — can you elaborate on “what’s wrong with an information schema and an order by”? I don’t see how that’s connected to the problems I described with pg_dump or the solution offered by pgmigrate, but I’d be curious to learn another way of solving be problem.

hobs · on Nov 9, 2024

Just the problem with ordering of state in a database not matching the other database after you ran migrations. Information schema being the authoritative schema for all the information about the meta state in most database systems (or at least the cross platform one) means you can write a query against these views and (if I am not mistaken) re-assemble state that's identical by ordering the application of the objects in the order they were created originally.

fooododododo · on Nov 9, 2024

Is everything in pgdump in the information schema?

jitl · on Nov 9, 2024

I guess the dependency order is nice, but I’ve been frustrated that Postgres doesn’t provide an efficient binary data format that user code can read or write easily. The SQL, text, and CSV formats are all reasonable and nice for small databases, but if you need to move and especially transform many gigabytes of Postgres data, you are out of luck.

The disk binary format of pg_dump and COPY are opaque different formats, the COPY docs have this to say about the binary format:

> To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the send and recv functions for each column’s data type (typically these functions are found in the src/backend/utils/adt/ directory of the source distribution)

I didn’t find an easy to use 3rd party library for decoding or reading these native data types, although maybe one exists (then I would chalk my pain up to Google getting worse…)

It feels like such a loss to spend CPU cores parsing ASCII CSV, or even worse interpreting & executing millions of SQL insert statements. To approach any kind of reasonable throughput near the maximum of disk or network IO the only answer is to have a big batch job export the whole DB into a reasonable binary format like parquet/arrow/whatever, and then use a replication job to keep it up to date using deltas. You know, if you have the time and money to set up a whole data warehouse thing.

Then your transform job can read the 3rd party binary data format relatively quickly and do its transform, but when it comes to moving the data back into Postgres, the fastest I’ve found is to turn it back into CSV, stream the CSV into Postgres using COPY, and spend valuable Postgres CPU cycles having it re-parse the CSV again.

Has anyone found a faster pathway to write bulk data into Postgres, perhaps that doesn’t involve at least a one trip through ASCII?

Icathian · on Nov 9, 2024

It's been pulled to close-source by Broadcom now, but if anyone is interested there was a fairly cool from-scratch go implementation of the same basic ideological approach as pg_dump for Greenplum. Fairly cool for seeing the catalog queries it takes to pull each piece of information that goes into a schema, at the very least.

https://github.com/greenplum-db/gpbackup-archive