This is actually something we're actively working on! Nhat Pham is working on a patch series called "virtual swap space" (https://lwn.net/Articles/1059201/) which decouples zswap from its backing store entirely. The goal is to consolidate on a single implementation with proper MM integration rather than maintaining two systems with very different failure modes. It should be out in the next few months, hopefully.
There are quite a few numbers in the article, although of course I'm happy to hear any more you'd like presented.
* A counterintuitive 25% reduction in disk writes at Instagram after enabling zswap
* Eventual ~5:1 compression ratio on Django workloads with zswap + zstd
* 20-30 minute OOM stalls at Cloudflare with the OOM killer never once firing under zram
The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
> The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
Yes, while it is all very plausible, the run times of a given workload (on a given, documented system) known to cause memory pressure to the point of swapping with vanilla Linux (default swappiness or some appropriate value), zram and zswap would be appreciated.
https://linuxblog.io/zswap-better-than-zram/ at least qualifies that zswap performs better when using a fast NVMe device as swap device and zram remains superior for devices with slow or no swap device.
Thank you for reading and your critique! What you're describing is definitely a real problem, but I'd challenge slightly and suggest the outcome is usually the inverse of what you might expect.
One of the counterintuitive things here is that _having_ disk swap can actually _decrease_ disk I/O. In fact this is so important to us on some storage tiers that it is essential to how we operate. Now, that sounds like patent nonsense, but hear me out :-)
With a zram-only setup, once zram is full, there is nowhere for anonymous pages to go. The kernel can't evict them to disk because there is no disk swap, so when it needs to free memory it has no choice but to reclaim file cache instead. If you don't allow the kernel to choose which page is colder across both anonymous and file-backed memory, and instead force it to only reclaim file caches, it is inevitable that you will eventually reclaim file caches that you actually needed to be resident to avoid disk activity, and those reads and writes hit the same slow DRAMless SSD you were trying to protect.
In the article I mentioned that in some cases enabling zswap reduced disk writes by up to 25% compared to having no swap at all. Of course, the exact numbers will vary across workloads, but the direction holds across most workloads that accumulate cold anonymous pages over time, and we've seen it hold on constrained environments like BMCs, servers, desktop, VR headsets, etc.
So, counter-intuitively, for your case, it may well be the case that zswap reduces disk I/O rather than increasing it with an appropriately sized swap device. If that's not the case that's exactly the kind of real-world data that helps us improve things on the mm side, and we'd love to hear about it :-)
1. Thanks for partially (in paragraph 4 but not paragraph 5) preempting the obvious objection. Distinguishing between disk reads and writes is very important for consumer SSDs, and you quoted exactly the right metric in paragraph 4: reduction of writes, almost regardless of the total I/O. Reads without writes are tolerable. Writes stall everything badly.
2. The comparison in paragraph 4 is between no-swap and zswap, and the results are plausible. But the relevant comparison here is a three-way one, between no-swap, zram, and zswap.
3. It's important to tune earlyoom "properly" when using zram as the only swap. Setting the "-m" argument too low causes earlyoom to miss obvious overloads that thrash the disk through page cache and memory-mapped files. On the other hand, with earlyoom, I could not find the right balance between unexpected OOM kills and missing the brownouts, simply because, with earlyoomd, the usage levels of RAM and zram-based swap are the only signals available for a decision. Perhaps systemd-oomd will fare better. The article does mention the need for tuning the userspace OOM killer to an uncomfortable degree.
I have already tried zswap with a swap file on a bad SSD, but, admittedly, not together with earlyoomd. With an SSD that cannot support even 10 MB/s of synchronous writes, it browns out, while zram + earlyoomd can be tuned not to brown out (at the expense of OOM kills on a subjectively perfectly well performing system). I will try backing-store-less zswap when it's ready.
And I agree that, on an enterprise SSD like Micron 7450 PRO, zswap is the way to go - and I doubt that Meta uses consumer SSDs.
Hey there! Article author here -- just found out this was posted here and going through the comments :-)
One of the earliest test versions of my patch actually did inode reuse using slabs just like you're suggesting, but there are a few practical issues:
1. Performance implications. We use tmpfs internally within the kernel in a lock-free manner as part of some latency-sensitive operations, and using slabs complicates that somewhat. The fact that we make use of tmpfs internally as the kernel makes this situation quite different than other filesystems.
2. Back when I was writing the patch, each memory cgroup had its own set of slabs, which greatly complicated being able to reuse inodes as slabs between different services (since they run in different memcgs).
After it became clear that slab recycling wouldn't work, I wrote a test patch that uses IDA instead, but I found that the performance implications were also not tenable. There are other alternative solutions but they increase code complexity/maintenance non-trivially and aren't really worth it.
A 64-bit per-superblock inode space resolves this issue without introducing any of these problems -- before you go through 2^64-1 inodes, you're going to have other practical constraints anyway, at least for the timebeing :-)
Oh that's interesting, thank you! Note that when I said should, 1) it carries no weight and 2) I was referring to the old impl, not what the fix should be. Going 64-bit sounds like a good option, hopefully it can become the default.
Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.
I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)
I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".
Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"
Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.
Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.
But at least for email/calendar backend its exchange
The internal replacement clients for calendar and other things are killer...have yet to find replacements
For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)
I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)
user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
ip 1: 2a03:2880:22ff:3::face:b00c (1 request)
ip 2: 2a03:2880:22ff:b::face:b00c (2 requests)
ASN: AS32934 FACEBOOK
I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.
I'd you want more info, send me a email and il dig out some logs etc. thu at db.no
I had no idea anyone would find this so fast, so I thought it would be safe to change the url to match the final title before posting anywhere -- how wrong I was.
I mean, it really depends on your application how non-trivial the performance improvement will be, but this statement isn't theoretical -- memory bound systems are a major case where being able to transfer out cold pages to swap can be a big win. In such systems, having optimal efficiency is all about having this balancing act between overall memory use without causing excessive memory pressure -- swap can not only reduce pressure, but often is able to allow reclaiming enough pages that we can increase application performance when memory is the constraining factor.
The real question is why those pages are being held in RAM. If they're needed, swapping them out will induce latency. I'd they're a leak or not needed, the application should be fixed to not allocate swathes of RAM it does not use.
There are some systems which are memory-bound by nature, not as a consequence of poor optimisation, so it's not really as simple as "needed" or "not needed". As a basic example, in compression, more memory available means that we can use a larger window size, and therefore have the opportunity to achieve higher compression ratios. There are plenty of more complex examples -- a lot of mapreduce work can be made more efficient with more memory available, for example.
Indeed. None of the above are typically used (as in most of the time) on desktop systems where swap is the most problematic.
As for compession, the only engine I know of that wants more than 128 MB of RAM is lrzip and other rzip derivatives.
Common offenders that bog down the system in swap for me as a developer are the web browser, JVM (Android) and electron based apps (messengers, two).
I would also like a source that substantiate the claim that using swap in map-reduce workloads actually helps.
Or perhaps in database workloads. Or on any machine with relatively fixed workload.
> how much time did I spend waiting to swap-in a page
You can do this with eBPF/BCC by using funclatency (https://github.com/iovisor/bcc/blob/master/tools/funclatency...) to trace swap related kernel calls. It depends on exactly what you want, but take a look at mm/swap.c and you'll probably find a function which results in the semantics you want.
> Once desktop systems and applications support required APIs to handle saving state before being shut down.
SIGTERM and friends? :-)
If your application is just dropping state on the floor as a result of having an intentionally trappable signal being sent to it or its children, that seems like a bug.
> I never saw programs I would actually want to use try to allocate that much
What about applications that have memory-bound performance characteristics? In these cases, saving a bit of memory often directly translates into throughput, which translates into $$$.
This isn't a theoretical, a bunch of services which I've run in the past and currently literally make more money because of swap. By using memory more efficiently and monitoring memory pressure metrics instead of just "freeness" (which is not really measurable anyway), we allow more efficient use of the machine overall.
reply