Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How PostgreSQL stores rows (ketansingh.me)
340 points by mattrighetti on Feb 9, 2022 | hide | past | favorite | 41 comments


For people interested in more details about Postgres internals I cannot recommend https://www.interdb.jp/pg/ enough, an excellent text with lots of details


What a great resource. Thank you for the recommendation!


Is it though? I mean, isn't the appeal of a relational DBMS that it is a black box to the user and it's the query planner's job to yield an efficient search for a given goal? That job will be easier, if the DBMS developers are free to chose internal representation as needed and aren't bound by earlier disclosed details.


> I mean, isn't the appeal of a relational DBMS that it is a black box to the user and it's the query planner's job to yield an efficient search for a given goal?

I wish


The book is still great, and some of us like to dig into database internals :)

It's by no means a requirement to use the database


Thank you! Sometimes the HN comments really pay off.


I wish I could +10 you.

I agree, a fantastic resource.


> If the page size is too small, rows won’t fit inside the page and if it’s too large there is risk of write failure because hardware generally can only guarantee atomicity for a fixed size blocks which can vary disk to disk (usually ranges from 512 bytes to 4096 bytes).

Pretty sure the 8KB page size is not about atomicity. It's true old storage mostly promised 512B atomicity (AFAIK it depends on operation, but that's irrelevant here) and we rely on that in some places. But even if it was 4KB, it's still be just half of 8kB pages. So an 8KB page might still get torn, and we can't rely on that.

The simple truth is that this is a compromise - smaller pages benefit OLTP, larger pages benefit OLAP. And 8kB is somewhere in the middle. That's all.

You can get better performance from e.g. 4KB pages, especially if that aligns with filesystem / SSD pages, etc.


if this post interests you at all I highly recommend watching "Intro to Database Systems" https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

I recently (as in 6 hours ago) had to relay to my boss some information about the inner workings of postgres. studying this course helped me do that.


The Postgres documentation is also excellent, far better than many other software packages. Postgres' academic lineage really shines there.


I wrote a PostgreSQL extension years ago that visualises pages (as text): https://github.com/hollobon/pagevis


This is why I love PG, people create these amazing extensions.

I wish Mongo/WT had a way to visualize the on disk structures...


I would love to learn from database experts here:

Why some databases choose to use segment files as persistence and why some other databases choose to use key-value as persistence?

From my perspective, using key-value is very convenient because you can then outsource persistence to a distributed, fault tolerant, key-value storage.


Many query plans (e.g. sorts, joins, aggregation) involve scanning large swaths of data. The importance of sequential access for such operations cannot be overstated. Storage, processors, and networks are designed to make sequential access performant.

Some SQL databases are backed by key-values stores, but they generally store large segments in the values rather than individual records. (e.g. Snowflake, Athena)


Segments also allow you to store some info about segment statistics, for example min and max for each numeric column which allow optimiser to skip entire segments when filtering. More tricks can be found in Snowflake paper.


There was another article posted today about LSM trees, and how they use segment files. I think at the end of the day the "distributed, fault tolerant, key-value storage" is using segment files too. By having your database use segment files directly you're just skipping a middleman.



There are no conflict: key-value is an interface, segment files is an implementation. You can implement a key-value using segment files too.


Not sure that I qualify as an expert, but in the case of Postgres the choice was probably influenced by the fact that readily available, distributed, fault tolerant, key-value storage wouldn't exist for at least another decade.


It's all very well until you discover the distributed, fault-tolerant, key-value storage is implemented using a database.


But what if that database is using a key value store?


it's turtles all the way down


it's tuples all the way down


perfect, thank you


Here is an another article that expands on how Postgres stores text data.

https://hakibenita.com/sql-medium-text-performance


Would also recommend reading "Designing Data-Intensive Applications" https://www.oreilly.com/library/view/designing-data-intensiv...

The chapter "Data Structures That Power Your Database" offers a great overview of various storage mechanisms of databases


> What if we run VACCUM ?

> VACCUM as expected defragmented page by moving the tuples around.

I wasn't paying full attention to the code blocks initially so these two statements confused me until I went back. VACUUM FULL is what was actually run - VACUUM would not have moved tuples around, but if VACUUM was run before the INSERT, then that empty spot from the DELETE could have been reused.


A normal vacuum will also defragment page contents, see PageRepairFragmentation()[1], which is called during normal vacuum from lazy_vacuum_page().

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...


Yeah, VACUUM only shuffles rows within a page, to maximize the amount of free space available for new data.

VACUUM FULL essentially rebuilds the segment files - it creates new files and shovels all rows from the old ones. In the past it was implemented differently, by actually moving rows (a bit like defrag tools in Windows) but that was very expensive and inefficient.


Cool article! Any idea where to find a similar explanation for mysql?


It might be interesting to compare how postgres' storage (~LSM+b-tree?) differs from b-epsilon trees which is the basis of betrfs discussed recently: https://news.ycombinator.com/item?id=29403320


I know it's not easy, but what's preventing PodtgreSQL from having compressed pages?


This post glosses over it, but larger fields are stored via the TOAST mechanism [1], which does support compression.

[1]: https://www.postgresql.org/docs/current/storage-toast.html


As you said, it's not easy. This is the long and the short of it. Especially if you want to compress multiple pages together and somehow reasonably handle compression of toasted values across multiple rows and pages.

That said, there has been some progress in the general direction of allowing this: supporting new storage implementations, e.g. zheap (https://wiki.postgresql.org/wiki/Zheap). But it's a large effort. Consider it has to implement new crash recovery, for starters.


Under some conditions they do [0]. How useful it ends up being depends on your data - pretty effective in my experience.

[0] https://www.postgresql.org/docs/current/storage-toast.html


You can, if you wish, run PostgresQL on a compressed filesystem.

But you'd better be sure that its storage guarantees match what Postgres needs, else you'll risk database corruption.


There are extensions that do this, e.g. Timescaledb with its column-oriented data (delta delta rle for integers, gorilla for reals, dictionary for others).


postgres performs compression on datatypes like array. compression adds overhead. i'm sure they have a reason to not compress everything. and i'm sure that reason has to do with prioritizing cpu over iops.


compressed storage would bring huge performance and size-on-disk wins. I suspect the issue is the difficulty and resources needed to add compression on top of the existing design rather than a decision that the tradeoffs are not worth it.


Well, the simple truth is adding efficient compression to the row storage (which is what heap does) is not really possible. Or more precisely - you can do that outside the database by using a filesystem with a compression (like zfs), and doing that within the database won't give you much advantage. Which is probably why no one really proposed implementing that, because the cost/gain ratio is just not worth it.

The problem with heap and compression is that it mixes a lot of different data (i.e. data from different columns), which is not great for compression ratio. To address that, it's necessary to use some variant of a columnar format, and allowing such stuff is one of the goals of the table AM API, zedstore AM etc.


A bit of off topic, but I have created a simple tool which converts simple text to database script -Postgres, MSSQL and MySql dbs are supported.

https://text2db.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: