ZFS is Sun. BTTFS is Oracle. And neither have anything to do with a database, IM...

ggm · on May 12, 2023

Oracle DBAs used to recommend using raw block devices to back the dB, a long long time ago.

I never did get one to explain why. I suspect it was received wisdom depending on VFS models which cached disk io and an effect on the DB engine. Almost certainly overtaken by both disk controller logic, file systems cache, striping and RAID and increasing memory sizes.

There are good pages on the Web discussing optimising ZFS for postgres tuned to write disk block aligned data through the zfs L2ARC

rrdharan · on May 12, 2023

> Almost certainly overtaken by both disk controller logic, file systems cache, striping and RAID and increasing memory sizes.

In practice this might be true if the defaults settings of each layer happen to align sympathetically and/or you know what you’re doing and how to tune them all, but from first principles it stands to reason that the extra abstraction layer of a file system isn’t providing the database storage engine with anything useful?

Same argument for unikernels, distroless containers, LiquidVM (https://docs.oracle.com/cd/E13223_01/wls-ve/docs92-v11/confi...) etc…

ggm · on May 13, 2023

Yes. It's still a valid proposition that raw disk is fastest if you map cleanly to the block abstraction the integrated controller offers, if only because of avoidance of indirect call stacks to carry data into the device.

But that said: the DBA in question constructed a catenated drive, not a stripe, and were slightly dismayed when we found out it was working at the speed of one disk interface and using only the front pack, not all of them in some kind of parallel or interleaved manner. Not a very smart DBA it turned out.

tanelpoder · on May 13, 2023

The reason for using raw/block devices back then (or Veritas VxFS that created additional raw devices to access VxFS files in a manner that bypassed regular OS filesystem I/O codepath & restrictions for reads & writes) was concurrent I/O.

In a busy database there are lots of processes reading a datafile and multiple processes writing (like DBWR checkpointing dirty blocks from buffer cache, but also some others), so this caused unnecessary file level contention in the OS kernel. Note that Oracle has I/O concurrency control built in, you can't have two processes accidentally writing the same block to the storage at the same time or a reader reading a "torn block" due to an ongoing write (the serialization happens at buffer lock/pin level in Oracle code).

So you wanted to have concurrent I/O and you either had to use raw devices or Veritas VxFS back then. Later on concurrent I/O became available for some other filesystems too, like Solaris UFS having the "cio" mount option, etc.