How PostgreSQL stores rows

rtpg · on Feb 10, 2022

For people interested in more details about Postgres internals I cannot recommend https://www.interdb.jp/pg/ enough, an excellent text with lots of details

ahmedalsudani · on Feb 10, 2022

What a great resource. Thank you for the recommendation!

guenthert · on Feb 10, 2022

Is it though? I mean, isn't the appeal of a relational DBMS that it is a black box to the user and it's the query planner's job to yield an efficient search for a given goal? That job will be easier, if the DBMS developers are free to chose internal representation as needed and aren't bound by earlier disclosed details.

WGH_ · on Feb 17, 2022

> I mean, isn't the appeal of a relational DBMS that it is a black box to the user and it's the query planner's job to yield an efficient search for a given goal?

I wish

ahmedalsudani · on Feb 10, 2022

The book is still great, and some of us like to dig into database internals :)

It's by no means a requirement to use the database

muh_gradle · on Feb 10, 2022

Thank you! Sometimes the HN comments really pay off.

CaptainZapp · on Feb 10, 2022

I wish I could +10 you.

I agree, a fantastic resource.

pgaddict · on Feb 10, 2022

> If the page size is too small, rows won’t fit inside the page and if it’s too large there is risk of write failure because hardware generally can only guarantee atomicity for a fixed size blocks which can vary disk to disk (usually ranges from 512 bytes to 4096 bytes).

Pretty sure the 8KB page size is not about atomicity. It's true old storage mostly promised 512B atomicity (AFAIK it depends on operation, but that's irrelevant here) and we rely on that in some places. But even if it was 4KB, it's still be just half of 8kB pages. So an 8KB page might still get torn, and we can't rely on that.

The simple truth is that this is a compromise - smaller pages benefit OLTP, larger pages benefit OLAP. And 8kB is somewhere in the middle. That's all.

You can get better performance from e.g. 4KB pages, especially if that aligns with filesystem / SSD pages, etc.

stickyricky · on Feb 10, 2022

if this post interests you at all I highly recommend watching "Intro to Database Systems" https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

I recently (as in 6 hours ago) had to relay to my boss some information about the inner workings of postgres. studying this course helped me do that.

zozbot234 · on Feb 10, 2022

The Postgres documentation is also excellent, far better than many other software packages. Postgres' academic lineage really shines there.

yangyang · on Feb 10, 2022

I wrote a PostgreSQL extension years ago that visualises pages (as text): https://github.com/hollobon/pagevis

winrid · on Feb 10, 2022

This is why I love PG, people create these amazing extensions.

I wish Mongo/WT had a way to visualize the on disk structures...

didip · on Feb 10, 2022

I would love to learn from database experts here:

Why some databases choose to use segment files as persistence and why some other databases choose to use key-value as persistence?

From my perspective, using key-value is very convenient because you can then outsource persistence to a distributed, fault tolerant, key-value storage.

mdavidn · on Feb 10, 2022

Many query plans (e.g. sorts, joins, aggregation) involve scanning large swaths of data. The importance of sequential access for such operations cannot be overstated. Storage, processors, and networks are designed to make sequential access performant.

Some SQL databases are backed by key-values stores, but they generally store large segments in the values rather than individual records. (e.g. Snowflake, Athena)

polskibus · on Feb 10, 2022

Segments also allow you to store some info about segment statistics, for example min and max for each numeric column which allow optimiser to skip entire segments when filtering. More tricks can be found in Snowflake paper.

Something1234 · on Feb 10, 2022

There was another article posted today about LSM trees, and how they use segment files. I think at the end of the day the "distributed, fault tolerant, key-value storage" is using segment files too. By having your database use segment files directly you're just skipping a middleman.

binarray2000 · on Feb 10, 2022

Adding for (future) reference

https://news.ycombinator.com/item?id=30269286

dikei · on Feb 10, 2022

There are no conflict: key-value is an interface, segment files is an implementation. You can implement a key-value using segment files too.

radiowave · on Feb 10, 2022

Not sure that I qualify as an expert, but in the case of Postgres the choice was probably influenced by the fact that readily available, distributed, fault tolerant, key-value storage wouldn't exist for at least another decade.

inopinatus · on Feb 10, 2022

It's all very well until you discover the distributed, fault-tolerant, key-value storage is implemented using a database.

jimbokun · on Feb 10, 2022

But what if that database is using a key value store?

inopinatus · on Feb 10, 2022

it's turtles all the way down

kklisura · on Feb 10, 2022

it's tuples all the way down

inopinatus · on Feb 11, 2022

perfect, thank you

irrational · on Feb 10, 2022

Here is an another article that expands on how Postgres stores text data.

https://hakibenita.com/sql-medium-text-performance

ptrik · on Feb 10, 2022

Would also recommend reading "Designing Data-Intensive Applications" https://www.oreilly.com/library/view/designing-data-intensiv...

The chapter "Data Structures That Power Your Database" offers a great overview of various storage mechanisms of databases

Izkata · on Feb 10, 2022

> What if we run VACCUM ?

> VACCUM as expected defragmented page by moving the tuples around.

I wasn't paying full attention to the code blocks initially so these two statements confused me until I went back. VACUUM FULL is what was actually run - VACUUM would not have moved tuples around, but if VACUUM was run before the INSERT, then that empty spot from the DELETE could have been reused.

ants_a · on Feb 10, 2022

A normal vacuum will also defragment page contents, see PageRepairFragmentation()[1], which is called during normal vacuum from lazy_vacuum_page().

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...

pgaddict · on Feb 10, 2022

Yeah, VACUUM only shuffles rows within a page, to maximize the amount of free space available for new data.

VACUUM FULL essentially rebuilds the segment files - it creates new files and shovels all rows from the old ones. In the past it was implemented differently, by actually moving rows (a bit like defrag tools in Windows) but that was very expensive and inefficient.

furstenheim · on Feb 10, 2022

Cool article! Any idea where to find a similar explanation for mysql?

infogulch · on Feb 10, 2022

It might be interesting to compare how postgres' storage (~LSM+b-tree?) differs from b-epsilon trees which is the basis of betrfs discussed recently: https://news.ycombinator.com/item?id=29403320

ivoras · on Feb 9, 2022

I know it's not easy, but what's preventing PodtgreSQL from having compressed pages?

uhoh-itsmaciek · on Feb 10, 2022

This post glosses over it, but larger fields are stored via the TOAST mechanism [1], which does support compression.

[1]: https://www.postgresql.org/docs/current/storage-toast.html

fdr · on Feb 10, 2022

As you said, it's not easy. This is the long and the short of it. Especially if you want to compress multiple pages together and somehow reasonably handle compression of toasted values across multiple rows and pages.

That said, there has been some progress in the general direction of allowing this: supporting new storage implementations, e.g. zheap (https://wiki.postgresql.org/wiki/Zheap). But it's a large effort. Consider it has to implement new crash recovery, for starters.

rwbhn · on Feb 10, 2022

Under some conditions they do [0]. How useful it ends up being depends on your data - pretty effective in my experience.

[0] https://www.postgresql.org/docs/current/storage-toast.html

btilly · on Feb 10, 2022

You can, if you wish, run PostgresQL on a compressed filesystem.

But you'd better be sure that its storage guarantees match what Postgres needs, else you'll risk database corruption.

eska · on Feb 10, 2022

There are extensions that do this, e.g. Timescaledb with its column-oriented data (delta delta rle for integers, gorilla for reals, dictionary for others).

stickyricky · on Feb 10, 2022

postgres performs compression on datatypes like array. compression adds overhead. i'm sure they have a reason to not compress everything. and i'm sure that reason has to do with prioritizing cpu over iops.

jstrong · on Feb 10, 2022

compressed storage would bring huge performance and size-on-disk wins. I suspect the issue is the difficulty and resources needed to add compression on top of the existing design rather than a decision that the tradeoffs are not worth it.

pgaddict · on Feb 10, 2022

Well, the simple truth is adding efficient compression to the row storage (which is what heap does) is not really possible. Or more precisely - you can do that outside the database by using a filesystem with a compression (like zfs), and doing that within the database won't give you much advantage. Which is probably why no one really proposed implementing that, because the cost/gain ratio is just not worth it.

The problem with heap and compression is that it mixes a lot of different data (i.e. data from different columns), which is not great for compression ratio. To address that, it's necessary to use some variant of a columnar format, and allowing such stuff is one of the goals of the table AM API, zedstore AM etc.

text2db · on Feb 10, 2022

A bit of off topic, but I have created a simple tool which converts simple text to database script -Postgres, MSSQL and MySql dbs are supported.

https://text2db.com/