For people interested in more details about Postgres internals I cannot recommend https://www.interdb.jp/pg/ enough, an excellent text with lots of details
Is it though? I mean, isn't the appeal of a relational DBMS that it is a black box to the user and it's the query planner's job to yield an efficient search for a given goal? That job will be easier, if the DBMS developers are free to chose internal representation as needed and aren't bound by earlier disclosed details.
> I mean, isn't the appeal of a relational DBMS that it is a black box to the user and it's the query planner's job to yield an efficient search for a given goal?
> If the page size is too small, rows won’t fit inside the page and if it’s too large there is risk of write failure because hardware generally can only guarantee atomicity for a fixed size blocks which can vary disk to disk (usually ranges from 512 bytes to 4096 bytes).
Pretty sure the 8KB page size is not about atomicity. It's true old storage mostly promised 512B atomicity (AFAIK it depends on operation, but that's irrelevant here) and we rely on that in some places. But even if it was 4KB, it's still be just half of 8kB pages. So an 8KB page might still get torn, and we can't rely on that.
The simple truth is that this is a compromise - smaller pages benefit OLTP, larger pages benefit OLAP. And 8kB is somewhere in the middle. That's all.
You can get better performance from e.g. 4KB pages, especially if that aligns with filesystem / SSD pages, etc.
Why some databases choose to use segment files as persistence and why some other databases choose to use key-value as persistence?
From my perspective, using key-value is very convenient because you can then outsource persistence to a distributed, fault tolerant, key-value storage.
Many query plans (e.g. sorts, joins, aggregation) involve scanning large swaths of data. The importance of sequential access for such operations cannot be overstated. Storage, processors, and networks are designed to make sequential access performant.
Some SQL databases are backed by key-values stores, but they generally store large segments in the values rather than individual records. (e.g. Snowflake, Athena)
Segments also allow you to store some info about segment statistics, for example min and max for each numeric column which allow optimiser to skip entire segments when filtering. More tricks can be found in Snowflake paper.
There was another article posted today about LSM trees, and how they use segment files. I think at the end of the day the "distributed, fault tolerant, key-value storage" is using segment files too. By having your database use segment files directly you're just skipping a middleman.
Not sure that I qualify as an expert, but in the case of Postgres the choice was probably influenced by the fact that readily available, distributed, fault tolerant, key-value storage wouldn't exist for at least another decade.
> VACCUM as expected defragmented page by moving the tuples around.
I wasn't paying full attention to the code blocks initially so these two statements confused me until I went back. VACUUM FULL is what was actually run - VACUUM would not have moved tuples around, but if VACUUM was run before the INSERT, then that empty spot from the DELETE could have been reused.
Yeah, VACUUM only shuffles rows within a page, to maximize the amount of free space available for new data.
VACUUM FULL essentially rebuilds the segment files - it creates new files and shovels all rows from the old ones. In the past it was implemented differently, by actually moving rows (a bit like defrag tools in Windows) but that was very expensive and inefficient.
It might be interesting to compare how postgres' storage (~LSM+b-tree?) differs from b-epsilon trees which is the basis of betrfs discussed recently: https://news.ycombinator.com/item?id=29403320
As you said, it's not easy. This is the long and the short of it. Especially if you want to compress multiple pages together and somehow reasonably handle compression of toasted values across multiple rows and pages.
That said, there has been some progress in the general direction of allowing this: supporting new storage implementations, e.g. zheap (https://wiki.postgresql.org/wiki/Zheap). But it's a large effort. Consider it has to implement new crash recovery, for starters.
There are extensions that do this, e.g. Timescaledb with its column-oriented data (delta delta rle for integers, gorilla for reals, dictionary for others).
postgres performs compression on datatypes like array. compression adds overhead. i'm sure they have a reason to not compress everything. and i'm sure that reason has to do with prioritizing cpu over iops.
compressed storage would bring huge performance and size-on-disk wins. I suspect the issue is the difficulty and resources needed to add compression on top of the existing design rather than a decision that the tradeoffs are not worth it.
Well, the simple truth is adding efficient compression to the row storage (which is what heap does) is not really possible. Or more precisely - you can do that outside the database by using a filesystem with a compression (like zfs), and doing that within the database won't give you much advantage. Which is probably why no one really proposed implementing that, because the cost/gain ratio is just not worth it.
The problem with heap and compression is that it mixes a lot of different data (i.e. data from different columns), which is not great for compression ratio. To address that, it's necessary to use some variant of a columnar format, and allowing such stuff is one of the goals of the table AM API, zedstore AM etc.