One thing I have long been suspicious of, but never swept out the parameter space to know for sure, is how one should set sunit, swidth, agcount or agsize to exploit the parallelism of an SSD and modern CPU. Used to be that sunit, swidth, and agsize were critical for RAIDs because getting them wrong would mean weird stripe misalignment, and if you did not set agsize all of your metadata load would be on a subset of the array. These days an SSD has way more natural parallelism than any RAID ever had, and we have more CPU cores than ever. Are we supposed to be setting sunit, swidth, and agsize? Or does the flash translation mean it doesn't matter? And how best to mkfs.xfs when your SSD offers LBA sizes other than 512?
Not that I'm advocating everyone spam him with their XFS questions, but Dave Chinner has always been helpful and responsive whenever I had informed questions for him. If you raised such a question on lkml or fsdevel directed at him, it wouldn't surprise me one bit if you (and everyone subscribed) saw a helpful response in a timely manner.
You might be right, and I have no reason to doubt the expertise and helpfulness of this individual. However, in my experience, it’s often true that the developers of software are not experts in the systems-level performance and efficiency questions. So I’d love to read about some hard data relevant to the question, some kind of metadata-heavy benchmark or something.
You might want to look up XFS benchmarks from a few years ago and compare them to today's if you think Dave is not an expert on (file)system performance.
I'm a fan of XFS. I've used it for over a decade for all systems that don't need ZFS.
In fact, due to the lifespan of the headless box under my desk (ie, predates bootable ZFS root partitions under Linux), its root partition is still XFS, while the actual real storage is ZFS managed
I think the main issue with ZFS is that you need to learn a lot of concepts to actually understand how to do things correctly and don't fuck up.
To this day I've written some personal notes as warning reminders to be careful about certain operations. I can't remember the exact terminology (I believe these are called "attributes" or "features" in ZFS), but what happened to me not long ago was that if I created a ZFS filesystem (along with a pool) from Ubuntu it was creating some of these features that ZFS on FreeBSD didn't properly recognize and didn't allowed me to successfully mount it.
The other way round creating it from FreeBSD and then mounting it on Ubuntu worked as expected. The thing with these "features" is that they can be "promoted" or "upgraded" and in this case if you do so will again render the ZFS unavailable to be mounted in FreeBSD. I spent a whole day on this until for some reason I decided to print this list and effectively compare and do some online search about it.
> if I created a ZFS filesystem (along with a pool) from Ubuntu it was creating some of these features that ZFS on FreeBSD didn't properly recognize and didn't allowed me to successfully mount it.
I mean... shrugs ...you could format it as XFS and not be able to mount on FreeBSD either? Seems like feature flags is a good way to solve this problem.
I was actually going to ask that... Is XFS a good "portable" filesystem across GNU/Linux and different BSD flavors? Native support or through FUSE? The use case are USB drives. I don't mind losing visibility on Windows or MacOS, just to work flawlessly between these without major efforts.
I could be wrong but I didn't go with UFS because apparently there are significant implementation differences among BSDs. I was kind of surprised to discover that.
FreeBSD and NetBSD essentially derive from the same source, 4.4BSD-Lite from March 1994. OpenBSD is a fork of NetBSD from 1995.
There's been a lot of divergence, and there's not really anything pushing towards convergence. FreeBSD modified UFS to meet their needs and desires and other BSDs went in other directions. There wasn't a lot of clammoring for a disk based data format for data interchange, so there's not a big reason to keep the divergent UFSes compatible. Exchange data via networks or tape archives or tars written to block devices or tarfiles on a widely recognized filesystem (msdos fat will do).
This pattern of a shared source and divergent development isn't super common. It's pretty rare commercially, and few opensource projects have forks that diverge and stay active for decades.
As a BSD user I think the divergence is a good thing. it means there are substantial differences which allows you to pick an OS that's really tailored to your usecase.
In the Linux world there's mainly just userland differences. I like it the BSD way. Of course Linux sees much more commercial input which I consider a bad thing and one reason I use FreeBSD so much. But for commercial interests it's a good thing to have as much in common as possible to have more potential customers.
You would probably not encounter that issue these days since FreeBSD 13 switched out the old FreeBSD ZFS tree in favor of the same OpenZFS you'd get on Linux https://cgit.freebsd.org/src/commit/?id=9e5787d2284e
The other comments already covered this, but yes, that is arguably the correct use of features: only enable the features (many of which can only be set at file system creation time; others can be upgraded to in a one way direction) that are supported by your target hosts.
At the time you last ran it (also said by sibling comment), FreeBSD still had not yet ran upstream OpenZFS, and still ran their own older code. To have had FBSD compatibility at the time, you would have had to only enable the features FBSD supported.
Generating the file system on FBSD, and not running "zfs upgrade" on Ubuntu would be one way of doing this, to avoid having to manually set features at all.
Today this is not an issue, everyone runs the same code.
Ages ago XFS had a rather nasty behavior in case of power failure - any files opened for write would be deleted after restart. From what I remember this was by design. Has this changed?
What you describe was common on the SGI systems we used at work. Some setups had a configuration file which was constantly written to and read from, and that file would be (most of the time) empty if there was a power failure (I don't btw know why the SGI systems didn't have a power-failure-emergency-shutdown mode, the power supplies kept power for several seconds. But anyway).
However: This _never_ happened with XFS on Linux systems. Exact same software. I don't know why. But XFS has been incredibly stable for not only my personal boxes but also for everything we have provided to customers at work. We need non-varying sustained write rates for huge amounts of data, and XFS is smooth, much better than when tested against e.g. ext4 (the tests we did were done years ago, we haven't retested as XFS just works.
I stayed away from XFS because it had another bad behaviour: after a crash it will do a replay of the log and happily continue. After a couple of crashes the filesystem became so corrupted that even the replay of the log failed and fsck was useless.
I tried also the option of running fsck after every crash but this also did not help (some crashes seems to mess up the filesystem badly). At the end i stayed with JFS ( which i was also testing at that time together with Reiserfs) because it was the best balance between speed and CPU power at that time.
Dynamically created inodes (might be useful if you work with a very large number of files — ext4 might run out of them and refuse to create new files even if you have lots of free space), more stable performance under load (standard distributions of latency and throughput are typically lower), reflinks. In return you lose BSD and Windows compatibility, if you ever need those, and performance averages are somewhat lower (used to be a lot lower but XFS has caught up very close to ext4).
Doubtful. Although they've both improved over the years, the conventional wisdom was that XFS' benefits were seen on very large volumes, and ext4 was more efficient for small reads/writes and metadata operations. This explains why XFS is more niche now that btrfs and zfs are around.
RHEL is the only distro I know that defaults to XFS.
How large is "very large", though, and has that changed with time? The last time I used XFS, I would have considered 500GB to be pretty large. Nowadays that's kinda mediocre. I have a 24TB RAID5 that's still on ext4 though; I imagine that qualifies as at least "fairly large"?
The last time I had to participate in a RHEL install, the installer would do ext4 if <16TB, xfs if >16TB.
I find this size unusually arbitrary, but I suspect Red Hat found unwanted behavior in some ext related code. This was after the known issue in e2fsprogs that was fixed around a decade ago preventing fscking of >16TB; RHEL of a decade ago was either "xfs by default" or "xfs if >2TB" or similar, and the installer clearly changed since then.
Casual Googling also says my experience with "RHEL says XFS > 16TB" is out of date, and its now "XFS > 100TB". And like, look, if you're doing 100TB, use ZFS, stop fucking around and do it right.
One thing that ext4 has and XFS does not is extremely delayed writes. Ext4 can postpone writes for tens of seconds, essentially minutes.
This has various fun implications for software that cares about durable writes, like databases.
XFS is incredibly robust. I manage many petabytes of XFS machines. Some have XFS volumes in the 2 PiB range. And it's also extremely fast. And I remember when I created my first folder with several billion files and it just worked :)
My exact experience as well. We use XFS for every system at work which needs lots of data input/output and large amounts of files. Never had a filesystem failure.
If you look at database benchmarks on TPC.org, you will find that many of them are on XFS (as a consequence of using RHEL/CentOS), which was designed with database performance in mind.
It used to be necessary for Oracle to use raw partitions to get the very best database performance. I haven't heard of anyone doing that for many years.
"The default implementation in most kernels is simply a file-global lock placed at the in-memory inode, making sure there can be only one writer per inode. Implementers of databases hate that because it limits the write concurrency on any single file to One. This is, for example, why Oracle recommends that you make tablespaces from multiple files, each no larger than one GB.
"XFS, in O_DIRECT mode, removes this lock and allows atomic, concurrent writes, making database people very happy."
Developers at Oracle have posted a number of blog posts on their code changes for Linux XFS.
Note that Oracle still uses "raw" block devices for high-performance configurations. It's called Oracle Automatic Storage Management (ASM), basically you give it a bunch of block devices without any partitions or file systems on them and ASM does its own "extent" placement, striping, mirroring, snapshots, etc. And Oracle's Exadata scale-out appliance also uses block devices without any filesystems directly, the DB nodes talk to any storage nodes over InfiniBand/RDMA (older hardware) or RDMA over Ethernet/RoCE.
Btrfs is the filesystem for which they are most responsible.
They explicitly do not support its use for their database.
"[Btrfs] was initially designed at Oracle Corporation in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel."
Oracle DBAs used to recommend using raw block devices to back the dB, a long long time ago.
I never did get one to explain why. I suspect it was received wisdom depending on VFS models which cached disk io and an effect on the DB engine. Almost certainly overtaken by both disk controller logic, file systems cache, striping and RAID and increasing memory sizes.
There are good pages on the Web discussing optimising ZFS for postgres tuned to write disk block aligned data through the zfs L2ARC
> Almost certainly overtaken by both disk controller logic, file systems cache, striping and RAID and increasing memory sizes.
In practice this might be true if the defaults settings of each layer happen to align sympathetically and/or you know what you’re doing and how to tune them all, but from first principles it stands to reason that the extra abstraction layer of a file system isn’t providing the database storage engine with anything useful?
Yes. It's still a valid proposition that raw disk is fastest if you map cleanly to the block abstraction the integrated controller offers, if only because of avoidance of indirect call stacks to carry data into the device.
But that said: the DBA in question constructed a catenated drive, not a stripe, and were slightly dismayed when we found out it was working at the speed of one disk interface and using only the front pack, not all of them in some kind of parallel or interleaved manner. Not a very smart DBA it turned out.
The reason for using raw/block devices back then (or Veritas VxFS that created additional raw devices to access VxFS files in a manner that bypassed regular OS filesystem I/O codepath & restrictions for reads & writes) was concurrent I/O.
In a busy database there are lots of processes reading a datafile and multiple processes writing (like DBWR checkpointing dirty blocks from buffer cache, but also some others), so this caused unnecessary file level contention in the OS kernel. Note that Oracle has I/O concurrency control built in, you can't have two processes accidentally writing the same block to the storage at the same time or a reader reading a "torn block" due to an ongoing write (the serialization happens at buffer lock/pin level in Oracle code).
So you wanted to have concurrent I/O and you either had to use raw devices or Veritas VxFS back then. Later on concurrent I/O became available for some other filesystems too, like Solaris UFS having the "cio" mount option, etc.
I use both XFS and ZFS. XFS in any situation where it is a single drive (ie, my laptop), ZFS in any situation where I'm doing actual fault tolerant high performance storage stuff.
I don't see where ext family really matters in modern storage. If I decide to not use XFS anymore, there are flash-dedicated filesystems that will beat ext in performance and reliability on SSDs that better fit the future.
SSDs are very likely to put all those copies into a single physical block underneath. Since ZFS makes backups easy, better stick to copies=1 and do backups often.
I shouldn't think that's true. Why wouldn't the firmware be splitting writes across different physical chips for performance and wear leveling reasons?
What do you mean? The discussion in this particular sub-thread is about running ZFS on a single drive (think laptop). Does it have some kind of mechanism to send the write for a copy "later enough" that it will likely end on a different physical block?
I meant "primarily this copies mechanism is targeted towards multiple devices setup".
With a single SSD it's indeed prone to the caveat which was pointed out; even if not due to being mapped to same storage "area" but also because SSDs often fail completely.
Also makes sense to note that __when__ narrowed to a single-disk setup ZFS' can be interchanged with Btrfs; almost same set of features but lesser overhead and complexity.
That sucks. I know a lot of Android users that prefer f2fs over ext* due to reliability and performance concerns, and seem to at least have some concrete reason to believe so.
I'm too lazy to summarize this beyond saying both ext4 and XFS have their advantages and disadvantages, and both are still under active development. ZFS is a funny and problematic one that does lots of cool tricks but isn't properly free and gobbles resources.
ZFS (or rather, OpenZFS) is absolutely Free Software and Open Source under any definition. Richard Stallman will tell you the same thing.
It just has an unfortunate license that's not GPL-compatible and this causes problems with GPL kernels (e.g. Linux), which is a very different thing than "not being properly free".
However, the wiki article also notes that Ubuntu includes it by default as a binary module. That seems like a sensible thing to do from a legal perspective, since it's generally accepted that binary modules are not "linked to" the kernel in such a way as to trigger GPL provisions relating to such linkage.
"Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me."
-Linus Torvalds
ZFS is absolutely Free Software, it's just that its license isn't GPLv2 compatible. (Bearing in mind that by the same stroke, GPLv3 isn't GPLv2 compatible.)
While we're on filesystems, what would be the better choice for a home NAS array with, say, 10-20 individual drives and, say, ~2-300 TB raw storage: ZFS or btrfs? Let's assume that all drives in the array are connected via a common SAS or SATA 3 backplane and that the array is on a local network with at least 1Gbps speed.
Is there any other fs that would even merit consideration at this point?
How does the inclusion of SSDs as cache drives for the array modify this?
ZFS for sure. RAID on btrfs is still not quite stable and commercial deployments such as Synology use a different software raid layer instead of the built in support.
Ps just considering reliability here. In terms of performance I don't know.
Performance is better on ZFS too, especially for bittorrent and VM images. Just remember to set recordsize appropriately (typically 1M for large file storage).
> Also, the Unix community as a whole was challenged by Cutler and Custer, who showed with NTFS for Windows NT 4.0 what was possible if you redesign from scratch.
NTFS is a slightly modified ODS-11. It was not a redesign from scratch.
What are you trying to say? LVM sits underneath a filesystem; it's not an alternative or replacement for XFS (in fact, they make a pretty good combination).