The SGI XFS Filesystem

jeffbee · on May 12, 2023

One thing I have long been suspicious of, but never swept out the parameter space to know for sure, is how one should set sunit, swidth, agcount or agsize to exploit the parallelism of an SSD and modern CPU. Used to be that sunit, swidth, and agsize were critical for RAIDs because getting them wrong would mean weird stripe misalignment, and if you did not set agsize all of your metadata load would be on a subset of the array. These days an SSD has way more natural parallelism than any RAID ever had, and we have more CPU cores than ever. Are we supposed to be setting sunit, swidth, and agsize? Or does the flash translation mean it doesn't matter? And how best to mkfs.xfs when your SSD offers LBA sizes other than 512?

pengaru · on May 12, 2023

Not that I'm advocating everyone spam him with their XFS questions, but Dave Chinner has always been helpful and responsive whenever I had informed questions for him. If you raised such a question on lkml or fsdevel directed at him, it wouldn't surprise me one bit if you (and everyone subscribed) saw a helpful response in a timely manner.

jeffbee · on May 12, 2023

You might be right, and I have no reason to doubt the expertise and helpfulness of this individual. However, in my experience, it’s often true that the developers of software are not experts in the systems-level performance and efficiency questions. So I’d love to read about some hard data relevant to the question, some kind of metadata-heavy benchmark or something.

5e92cb50239222b · on May 13, 2023

You might want to look up XFS benchmarks from a few years ago and compare them to today's if you think Dave is not an expert on (file)system performance.

DiabloD3 · on May 12, 2023

I'm a fan of XFS. I've used it for over a decade for all systems that don't need ZFS.

In fact, due to the lifespan of the headless box under my desk (ie, predates bootable ZFS root partitions under Linux), its root partition is still XFS, while the actual real storage is ZFS managed

liendolucas · on May 12, 2023

I think the main issue with ZFS is that you need to learn a lot of concepts to actually understand how to do things correctly and don't fuck up.

To this day I've written some personal notes as warning reminders to be careful about certain operations. I can't remember the exact terminology (I believe these are called "attributes" or "features" in ZFS), but what happened to me not long ago was that if I created a ZFS filesystem (along with a pool) from Ubuntu it was creating some of these features that ZFS on FreeBSD didn't properly recognize and didn't allowed me to successfully mount it.

The other way round creating it from FreeBSD and then mounting it on Ubuntu worked as expected. The thing with these "features" is that they can be "promoted" or "upgraded" and in this case if you do so will again render the ZFS unavailable to be mounted in FreeBSD. I spent a whole day on this until for some reason I decided to print this list and effectively compare and do some online search about it.

mustache_kimono · on May 12, 2023

> if I created a ZFS filesystem (along with a pool) from Ubuntu it was creating some of these features that ZFS on FreeBSD didn't properly recognize and didn't allowed me to successfully mount it.

I mean... shrugs ...you could format it as XFS and not be able to mount on FreeBSD either? Seems like feature flags is a good way to solve this problem.

liendolucas · on May 12, 2023

I was actually going to ask that... Is XFS a good "portable" filesystem across GNU/Linux and different BSD flavors? Native support or through FUSE? The use case are USB drives. I don't mind losing visibility on Windows or MacOS, just to work flawlessly between these without major efforts.

Edit: Added some comments.

mustache_kimono · on May 12, 2023

ZFS is really one of the only games in town AFAIK for serious filesystems. UFS might also work? Of course there is always exFAT, etc.

liendolucas · on May 12, 2023

I could be wrong but I didn't go with UFS because apparently there are significant implementation differences among BSDs. I was kind of surprised to discover that.

toast0 · on May 13, 2023

FreeBSD and NetBSD essentially derive from the same source, 4.4BSD-Lite from March 1994. OpenBSD is a fork of NetBSD from 1995.

There's been a lot of divergence, and there's not really anything pushing towards convergence. FreeBSD modified UFS to meet their needs and desires and other BSDs went in other directions. There wasn't a lot of clammoring for a disk based data format for data interchange, so there's not a big reason to keep the divergent UFSes compatible. Exchange data via networks or tape archives or tars written to block devices or tarfiles on a widely recognized filesystem (msdos fat will do).

This pattern of a shared source and divergent development isn't super common. It's pretty rare commercially, and few opensource projects have forks that diverge and stay active for decades.

wkat4242 · on May 13, 2023

As a BSD user I think the divergence is a good thing. it means there are substantial differences which allows you to pick an OS that's really tailored to your usecase.

In the Linux world there's mainly just userland differences. I like it the BSD way. Of course Linux sees much more commercial input which I consider a bad thing and one reason I use FreeBSD so much. But for commercial interests it's a good thing to have as much in common as possible to have more potential customers.

yyyk · on May 13, 2023

The BSDs support ext2fs in kernel, that should work. FAT would work too of course.

Lammy · on May 13, 2023

You would probably not encounter that issue these days since FreeBSD 13 switched out the old FreeBSD ZFS tree in favor of the same OpenZFS you'd get on Linux https://cgit.freebsd.org/src/commit/?id=9e5787d2284e

DiabloD3 · on May 13, 2023

The other comments already covered this, but yes, that is arguably the correct use of features: only enable the features (many of which can only be set at file system creation time; others can be upgraded to in a one way direction) that are supported by your target hosts.

At the time you last ran it (also said by sibling comment), FreeBSD still had not yet ran upstream OpenZFS, and still ran their own older code. To have had FBSD compatibility at the time, you would have had to only enable the features FBSD supported.

Generating the file system on FBSD, and not running "zfs upgrade" on Ubuntu would be one way of doing this, to avoid having to manually set features at all.

Today this is not an issue, everyone runs the same code.

vermaden · on May 13, 2023

You can read more about ZFS compatibility here for example:

- https://vermaden.wordpress.com/2022/03/25/zfs-compatibility/

marwis · on May 13, 2023

Ages ago XFS had a rather nasty behavior in case of power failure - any files opened for write would be deleted after restart. From what I remember this was by design. Has this changed?

Tor3 · on May 13, 2023

What you describe was common on the SGI systems we used at work. Some setups had a configuration file which was constantly written to and read from, and that file would be (most of the time) empty if there was a power failure (I don't btw know why the SGI systems didn't have a power-failure-emergency-shutdown mode, the power supplies kept power for several seconds. But anyway).

However: This _never_ happened with XFS on Linux systems. Exact same software. I don't know why. But XFS has been incredibly stable for not only my personal boxes but also for everything we have provided to customers at work. We need non-varying sustained write rates for huge amounts of data, and XFS is smooth, much better than when tested against e.g. ext4 (the tests we did were done years ago, we haven't retested as XFS just works.

jleahy · on May 13, 2023

I’ve run XFS at scale and I’ve seen exactly this behaviour (files open for writing becoming truncated after power loss).

However it was definitively fixed in a kernel update about 5-10 years ago. I haven’t seen it since.

hulitu · on May 13, 2023

I stayed away from XFS because it had another bad behaviour: after a crash it will do a replay of the log and happily continue. After a couple of crashes the filesystem became so corrupted that even the replay of the log failed and fsck was useless.

I tried also the option of running fsck after every crash but this also did not help (some crashes seems to mess up the filesystem badly). At the end i stayed with JFS ( which i was also testing at that time together with Reiserfs) because it was the best balance between speed and CPU power at that time.

chasil · on May 13, 2023

I think that is because the hardware did not enforce write barriers.

jleahy · on May 13, 2023

No, it happens with a proper RAID controller also, but is now fixed as I mentioned separately.

Paul-Craft · on May 12, 2023

> I'm a fan of XFS. I've used it for over a decade for all systems that don't need ZFS.

Hmm. I read the article, but now you've got me wondering: am I missing out on something by defaulting to ext4 rather than XFS?

5e92cb50239222b · on May 13, 2023

Dynamically created inodes (might be useful if you work with a very large number of files — ext4 might run out of them and refuse to create new files even if you have lots of free space), more stable performance under load (standard distributions of latency and throughput are typically lower), reflinks. In return you lose BSD and Windows compatibility, if you ever need those, and performance averages are somewhat lower (used to be a lot lower but XFS has caught up very close to ext4).

seabrookmx · on May 13, 2023

Doubtful. Although they've both improved over the years, the conventional wisdom was that XFS' benefits were seen on very large volumes, and ext4 was more efficient for small reads/writes and metadata operations. This explains why XFS is more niche now that btrfs and zfs are around.

RHEL is the only distro I know that defaults to XFS.

kelnos · on May 13, 2023

How large is "very large", though, and has that changed with time? The last time I used XFS, I would have considered 500GB to be pretty large. Nowadays that's kinda mediocre. I have a 24TB RAID5 that's still on ext4 though; I imagine that qualifies as at least "fairly large"?

DiabloD3 · on May 13, 2023

The last time I had to participate in a RHEL install, the installer would do ext4 if <16TB, xfs if >16TB.

I find this size unusually arbitrary, but I suspect Red Hat found unwanted behavior in some ext related code. This was after the known issue in e2fsprogs that was fixed around a decade ago preventing fscking of >16TB; RHEL of a decade ago was either "xfs by default" or "xfs if >2TB" or similar, and the installer clearly changed since then.

Casual Googling also says my experience with "RHEL says XFS > 16TB" is out of date, and its now "XFS > 100TB". And like, look, if you're doing 100TB, use ZFS, stop fucking around and do it right.

5e92cb50239222b · on May 13, 2023

XFS has been the default for everything since RHEL7 IIRC (the smallest disk I tried was something like 5 GBs).

nine_k · on May 13, 2023

One thing that ext4 has and XFS does not is extremely delayed writes. Ext4 can postpone writes for tens of seconds, essentially minutes. This has various fun implications for software that cares about durable writes, like databases.

nh2 · on May 13, 2023

If a software, especially a DB, cares about durable writes, it must surely use fsync(), not depend on how long some delays are.

orra · on May 13, 2023

Sort of. The problem is fsync() is a rather blunt tool, forcing too much to be synchronised, reducing concurrency.

I remember when ext4 was new there was debate about supporting atomic rename() without fsync(); see https://lwn.net/Articles/322823/

wazoox · on May 12, 2023

XFS is incredibly robust. I manage many petabytes of XFS machines. Some have XFS volumes in the 2 PiB range. And it's also extremely fast. And I remember when I created my first folder with several billion files and it just worked :)

Tor3 · on May 13, 2023

My exact experience as well. We use XFS for every system at work which needs lots of data input/output and large amounts of files. Never had a filesystem failure.

MichaelZuo · on May 12, 2023

How does it compare to ReFS or other more recent systems?, if you've ever looked into it.

klooney · on May 13, 2023

It's really hard to compare filesystems between windows and Linux- all of the layers above the filesystem are so different and confounding.

miohtama · on May 12, 2023

Does XFS offer benefits or tradeoffs of ZFS/ext4/others today?

I have been using it with Linux since early 2000. But I am not sure how the file-system development has been lately e.g. with the SSD revolution.

chasil · on May 12, 2023

If you look at database benchmarks on TPC.org, you will find that many of them are on XFS (as a consequence of using RHEL/CentOS), which was designed with database performance in mind.

It used to be necessary for Oracle to use raw partitions to get the very best database performance. I haven't heard of anyone doing that for many years.

"The default implementation in most kernels is simply a file-global lock placed at the in-memory inode, making sure there can be only one writer per inode. Implementers of databases hate that because it limits the write concurrency on any single file to One. This is, for example, why Oracle recommends that you make tablespaces from multiple files, each no larger than one GB.

"XFS, in O_DIRECT mode, removes this lock and allows atomic, concurrent writes, making database people very happy."

Developers at Oracle have posted a number of blog posts on their code changes for Linux XFS.

https://blogs.oracle.com/linux/post/introduction-to-xfs-tran...

https://blogs.oracle.com/linux/post/xfs-online-filesystem-ch...

https://blogs.oracle.com/linux/post/xfs-2019-development-ret...

https://blogs.oracle.com/linux/post/xfs-whats-new-and-whats-...

There can be a strong bias against Oracle at HN (and I understand why) but credit must be given where it's due.

tanelpoder · on May 13, 2023

Note that Oracle still uses "raw" block devices for high-performance configurations. It's called Oracle Automatic Storage Management (ASM), basically you give it a bunch of block devices without any partitions or file systems on them and ASM does its own "extent" placement, striping, mirroring, snapshots, etc. And Oracle's Exadata scale-out appliance also uses block devices without any filesystems directly, the DB nodes talk to any storage nodes over InfiniBand/RDMA (older hardware) or RDMA over Ethernet/RoCE.

miohtama · on May 12, 2023

ZFS is also from Oracle, right? Seems like a dominating force in the FS industry if one can call it a such.

chasil · on May 12, 2023

Btrfs is the filesystem for which they are most responsible.

They explicitly do not support its use for their database.

"[Btrfs] was initially designed at Oracle Corporation in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel."

https://en.m.wikipedia.org/wiki/Btrfs

They own the rights to ZFS by way of their purchase of Sun Microsystems.

Oracle is a power player in the realm of Linux filesystems, no doubt.

arp242 · on May 12, 2023

Linux uses OpenZFS, which is a fork of OpenSolaris' ZFS; they haven't shared any code since 2010 AFAIK.

whiskeytuesday · on May 12, 2023

I mean, in that it came out of Sun, kind of I guess.

ilc · on May 12, 2023

ZFS is Sun.

BTTFS is Oracle.

And neither have anything to do with a database, IMHO.

ggm · on May 12, 2023

Oracle DBAs used to recommend using raw block devices to back the dB, a long long time ago.

I never did get one to explain why. I suspect it was received wisdom depending on VFS models which cached disk io and an effect on the DB engine. Almost certainly overtaken by both disk controller logic, file systems cache, striping and RAID and increasing memory sizes.

There are good pages on the Web discussing optimising ZFS for postgres tuned to write disk block aligned data through the zfs L2ARC

rrdharan · on May 12, 2023

> Almost certainly overtaken by both disk controller logic, file systems cache, striping and RAID and increasing memory sizes.

In practice this might be true if the defaults settings of each layer happen to align sympathetically and/or you know what you’re doing and how to tune them all, but from first principles it stands to reason that the extra abstraction layer of a file system isn’t providing the database storage engine with anything useful?

Same argument for unikernels, distroless containers, LiquidVM (https://docs.oracle.com/cd/E13223_01/wls-ve/docs92-v11/confi...) etc…

ggm · on May 13, 2023

Yes. It's still a valid proposition that raw disk is fastest if you map cleanly to the block abstraction the integrated controller offers, if only because of avoidance of indirect call stacks to carry data into the device.

But that said: the DBA in question constructed a catenated drive, not a stripe, and were slightly dismayed when we found out it was working at the speed of one disk interface and using only the front pack, not all of them in some kind of parallel or interleaved manner. Not a very smart DBA it turned out.

tanelpoder · on May 13, 2023

The reason for using raw/block devices back then (or Veritas VxFS that created additional raw devices to access VxFS files in a manner that bypassed regular OS filesystem I/O codepath & restrictions for reads & writes) was concurrent I/O.

In a busy database there are lots of processes reading a datafile and multiple processes writing (like DBWR checkpointing dirty blocks from buffer cache, but also some others), so this caused unnecessary file level contention in the OS kernel. Note that Oracle has I/O concurrency control built in, you can't have two processes accidentally writing the same block to the storage at the same time or a reader reading a "torn block" due to an ongoing write (the serialization happens at buffer lock/pin level in Oracle code).

So you wanted to have concurrent I/O and you either had to use raw devices or Veritas VxFS back then. Later on concurrent I/O became available for some other filesystems too, like Solaris UFS having the "cio" mount option, etc.

DiabloD3 · on May 12, 2023

I think you meant ZFS, not zsh.

I use both XFS and ZFS. XFS in any situation where it is a single drive (ie, my laptop), ZFS in any situation where I'm doing actual fault tolerant high performance storage stuff.

I don't see where ext family really matters in modern storage. If I decide to not use XFS anymore, there are flash-dedicated filesystems that will beat ext in performance and reliability on SSDs that better fit the future.

XorNot · on May 12, 2023

ZFS is actually ideal for single drives because even if you can't do recovery, knowing a file has corrupted is extremely valuable.

anecdotal1 · on May 13, 2023

set copies=2 or 3 and it can recover corrupted data on that filesystem at the expense of storage

5e92cb50239222b · on May 13, 2023

SSDs are very likely to put all those copies into a single physical block underneath. Since ZFS makes backups easy, better stick to copies=1 and do backups often.

anecdotal1 · on May 13, 2023

I shouldn't think that's true. Why wouldn't the firmware be splitting writes across different physical chips for performance and wear leveling reasons?

poige · on May 13, 2023

Wrong. ZFS' copies are made across several different devices.

(2nd copy)

krylon · on May 13, 2023

If you are using ZFS on a single drive, they reside on the same drive by necessity. Which was the case the parent posters were talking about.

p01ge · on May 13, 2023

Wrong. ZFS' copies are made across several different devices.

vladvasiliu · on May 13, 2023

What do you mean? The discussion in this particular sub-thread is about running ZFS on a single drive (think laptop). Does it have some kind of mechanism to send the write for a copy "later enough" that it will likely end on a different physical block?

poige · on May 13, 2023

I meant "primarily this copies mechanism is targeted towards multiple devices setup".

With a single SSD it's indeed prone to the caveat which was pointed out; even if not due to being mapped to same storage "area" but also because SSDs often fail completely.

Also makes sense to note that __when__ narrowed to a single-disk setup ZFS' can be interchanged with Btrfs; almost same set of features but lesser overhead and complexity.

vermaden · on May 13, 2023

Single drive ZFS is also a great use case for ZFS because:

- full data integrity check (not only metadata like with XFS)

- transparent compression (LZ/ZSTD/...)

- ZFS Boot Environments - makes your system bulletproof against changes

- easy move ZFS Boot Environments between systems with zfs send|zfs recv commands

None of that is possible with XFS.

5e92cb50239222b · on May 13, 2023

> there are flash-dedicated filesystems that will beat ext in performance and reliability

If you mean f2fs, it really won't. Be careful with it.

https://www.usenix.org/system/files/atc19-jaffer.pdf

The paper is old at this point, but I found a fresh PhD thesis from the same author from 2023, and it seems to re-confirm these findings.

DiabloD3 · on May 13, 2023

That sucks. I know a lot of Android users that prefer f2fs over ext* due to reliability and performance concerns, and seem to at least have some concrete reason to believe so.

miohtama · on May 12, 2023

Fixed, thank you!

theodric · on May 12, 2023

I'm too lazy to summarize this beyond saying both ext4 and XFS have their advantages and disadvantages, and both are still under active development. ZFS is a funny and problematic one that does lots of cool tricks but isn't properly free and gobbles resources.

Check this: https://linuxhint.com/xfs-vs-ext4-brief-comparison/

arp242 · on May 12, 2023

ZFS (or rather, OpenZFS) is absolutely Free Software and Open Source under any definition. Richard Stallman will tell you the same thing.

It just has an unfortunate license that's not GPL-compatible and this causes problems with GPL kernels (e.g. Linux), which is a very different thing than "not being properly free".

Paul-Craft · on May 12, 2023

Context:

https://en.wikipedia.org/wiki/Common_Development_and_Distrib...

However, the wiki article also notes that Ubuntu includes it by default as a binary module. That seems like a sensible thing to do from a legal perspective, since it's generally accepted that binary modules are not "linked to" the kernel in such a way as to trigger GPL provisions relating to such linkage.

theodric · on May 13, 2023

"Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me." -Linus Torvalds

https://www.realworldtech.com/forum/?threadid=189711&curpost...

stormking · on May 13, 2023

Even Linus gets it wrong from time to time. And here he is absolutely wrong.

yjftsjthsd-h · on May 12, 2023

ZFS is absolutely Free Software, it's just that its license isn't GPLv2 compatible. (Bearing in mind that by the same stroke, GPLv3 isn't GPLv2 compatible.)

bombcar · on May 12, 2023

XFS was the first time I felt i had a “real” file system on Linux, ext2 being long in the fsck and reiserfs doing strange things with tail packing.

blipvert · on May 12, 2023

25+ years using XFS (IRIX user). Never a single problem (hardware failures aside).

My go-to filesystem.

metadat · on May 13, 2023

Do you run IRIX on all your personal machines at home?

blipvert · on May 13, 2023

My bad, I was unclear - I used to admin IRIX years ago, hence using XFS before it got ported to Linux, but then carried on using it once available.

I do have an old Challenge S in the basement that I should dig out though …

donatj · on May 13, 2023

We use XFS for an internal queue because it handles large numbers of files (millions - billions) better than basically anything else we've tried.

Paul-Craft · on May 13, 2023

While we're on filesystems, what would be the better choice for a home NAS array with, say, 10-20 individual drives and, say, ~2-300 TB raw storage: ZFS or btrfs? Let's assume that all drives in the array are connected via a common SAS or SATA 3 backplane and that the array is on a local network with at least 1Gbps speed.

Is there any other fs that would even merit consideration at this point?

How does the inclusion of SSDs as cache drives for the array modify this?

Does any of this change if we specify 10Gbps LAN?

wkat4242 · on May 13, 2023

ZFS for sure. RAID on btrfs is still not quite stable and commercial deployments such as Synology use a different software raid layer instead of the built in support.

Ps just considering reliability here. In terms of performance I don't know.

tiberious726 · on May 14, 2023

Raid 1 and 0 are perfectly stable on btrfs and have been for a decade. I don't think anyine ever fixed raid 5 or 6 on it tho...

5e92cb50239222b · on May 13, 2023

Performance is better on ZFS too, especially for bittorrent and VM images. Just remember to set recordsize appropriately (typically 1M for large file storage).

codetrotter · on May 13, 2023

> This is part 3 of a series. The first part is “1974 ”. The second part is “1984 ”.

> In 1994, the paper Scalability in the XFS File System saw publication. Computers got faster since 1984, and so did storages.

Crossing my fingers that either 2004 or 2014 will be about ZFS.

In 2013, OpenZFS was founded to coordinate the development of open source ZFS.

bediger4000 · on May 13, 2023

> Also, the Unix community as a whole was challenged by Cutler and Custer, who showed with NTFS for Windows NT 4.0 what was possible if you redesign from scratch.

NTFS is a slightly modified ODS-11. It was not a redesign from scratch.

johnea · on May 13, 2023

Bloatware since 1994...

johnea · on May 13, 2023

LVM ftw...

yjftsjthsd-h · on May 13, 2023

What are you trying to say? LVM sits underneath a filesystem; it's not an alternative or replacement for XFS (in fact, they make a pretty good combination).