At least windows doesn't freeze your whole desktop under heavy IO. Edit: I'm get...

newnewpdro · on Dec 29, 2018

You're being downvoted, but it's true about linux for at least two reasons to this day:

1. dm-crypt threads are unfair to the rest of the system's processes [1]. On dmcrypt systems, regular user processes can effectively raise their process scheduling priority in a multithreaded fashion by generating heavy IO on dmcrypt storage.

2. Under memory pressure, the VM system in linux will enter a thrashing state even when there is no swap configured at all. I don't have a reference on hand, but it's been discussed on lkml multiple times without solution. I suspect the recent PSI changes are intended as a step towards a solution though. What happens is clean, file-backed pages for things like shared libraries and executable programs become a thrashing set resembling anonymous memory swapping under memory pressure. As various processes get scheduled, they access pages which were recently discarded from the page cache as evictable due to their clean file-backed status when other processes ran under pressure, and now must be read back in from the backing store. This ping-ponging continues dragging everything down until either an OOM occurs or pressure is otherwise relieved. This often manifests as a pausing/freezing desktop with the disk activity light blazing, and it's only made worse by the aforementioned dmcrypt problem if these files reside on such volumes.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=199857

rikkus · on Dec 29, 2018

This behaviour - and perhaps other causes with similar issues for desktop users, is what drove me away from helping to make Linux ‘ready for the desktop’. The kernel philosophy was incompatible with the needs of desktop users. Those developing for the desktop (like myself) didn’t have the required expertise to make the kernel do the Right a Thing, and couldn’t find enough people willing or able to help.

Things may have changed over the years - I’ve been running a Linux desktop recently and haven’t seen this kind of issue yet (the kind where you need to use magic keys to ask the kernel to kill all, sync and reboot) but reading your post, perhaps this is because RAM is much more plentiful these days.

zanny · on Dec 29, 2018

Its probably the ram plentiful thing. I haven't looked recently into if the Arch mainline kernel kconfig is just bad or not but the oomkiller is trash for me. Used to have the "recommended" swap == ram size but then the memory manager never even tried to cull pages until it was OOM and froze up trying to swap constantly. Currently running a 16/4 spread and probably going to drop to 16/1 because any time I hit memory limits everything just freezes permanently rather than the oomkiller getting invoked. I've hit it twice this week trying to render in Kdenlive and run a debug build of Krita...

blattimwind · on Dec 29, 2018

Yeah, I still observe this from time to time, with SSD-only storage and 32 GB main memory.

newnewpdro · on Dec 29, 2018

If more people reproduced the dmcrypt issue and commented in the bugzilla issue it'd put more pressure on upstream to revert the known offending commit.

For some reason they seem to be prioritizing the supposedly improved dmcrypt performance over fairness under load, even though it makes our modern machines behave like computers from the 90s; unable to play MP3s and access the disk without audio underruns.

I assume it's because they're not hearing enough complaints.

Dylan16807 · on Dec 29, 2018

Is there no way to set a minimum page cache size?

adontz · on Dec 29, 2018

It does?

It does even much weirder things. https://blogs.technet.microsoft.com/markrussinovich/2007/08/...

qha · on Dec 29, 2018

Oh my God, they pretended everybody has a 100 Mbps NIC and capped at that speed. That's completely retarded.

At least I'm going to give them credit for disclosing what the problem was. I would've be too ashamed to admit it.

Zardoz84 · on Dec 29, 2018

I work and play every day over Linux and I never saw the desktop freezing.

kiwijamo · on Dec 30, 2018

I’ve attempted to use Linux and desktop freezing is the norm even on machines that run fine under windows. Admittedly the machines might be underpowered but that does call into question the commonly held belief that desktop Linux is better for low spec machines.

bnolsen · on Dec 30, 2018

I have had issues at times in the past but with things like core dumps on systems with spinning disk and 128gb ram. The OOM on linux can be brutally frustrating. But it's still light years ahead of windows for development...

zlynx · on Dec 29, 2018

I haven't noticed Linux doing that in the last few years.

But honestly, that could also be the NVMe drives.

Valmar · on Dec 30, 2018

Depends entirely on your choice of I/O Scheduler.

With bfq-mq, I've encountered no desktop freezing under heavy I/O.

kiwijamo · on Dec 30, 2018

Do any mainstream distro use this?

realusername · on Dec 30, 2018

> At least windows doesn't freeze your whole desktop under heavy IO.

It does it as well on Windows and unlike Linux, Windows still has a risk of permanent damage under low space available, it was even worse in the XP days but these issues are still there.

black-tea · on Dec 29, 2018

You're probably getting downvoted by people who have used Linux in the past 10 years.

Thaxll · on Dec 29, 2018

It was / still is that the default scheduler is / was bad under heavy IO, copying a file on a USB disk would freeze the whole system ( on desktop ).

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094

AnthonBerg · on Dec 29, 2018

I encountered this. I remember it. It was/is real. But I only happened to encounter it on Ubuntu, fwiw.

segfaultbuserr · on Dec 29, 2018

And it's a particularly interesting issue, because this problem mirrors the congestion control failure observed on most networks in recent years. We all have seen this problem, on a busy network, the latency will increase by two order of magnitudes, ruining other network activities like web browsing, even themselves require only a little bit of bandwidth. The simplest demo is uploading a large file, while observing the ping latency, it would just jump from 100ms to 2000ms. But it should not happen, because the TCP congestion control was just designed to solve it.

In turns out that the cause of this problem, known as bufferbloat, is the accumulated effect of excessive buffering in the network stack, mostly the system packet queue, but also includes the routers, switches, drivers and hardware, since RAM is cheap nowadays. The TCP congestion control works like this: if packet loss is detected, then sends at a lower rate. But when there are large buffers on the path for "improving performance", the packets are never lost when if the path is congested, instead, they would be put into a huge buffer, so TCP will never slow down properly as designed, and during the slow-start, it believes it's on the way going to the moon. On the other hand, all the buffers are FIFO, it means when your new packets have a chance to get out, it's probably no longer relevant, since it takes seconds for moving it from the tail of the queue to the head, the connection would be timed out already.

Solutions include killing buffers and limiting their length (byte queue limit, TCP small queue), another innovation is new queue management algorithms: we don't have to use a mindless FIFO queue, we can make them smarter. As a result, CoDel and fq_codel are invented to implement "DELay-COntrolled queues", they are designed to prioritize new packets that are just arrived, and dropping old packets to keep your traffic flowing.

And people realized the Linux I/O freeze is a variant of bufferbloat, and the very same ideas of the CoDel algorithm can be applied to the Linux I/O freeze problem.

Another interesting aspect was, that the problem is NOT OBSERVABLE if the network is fast enough, or the traffic is low, because the buffering does not occur, so it will never be caught in many benchmarks. On the other hand, when you start uploading a large file over a slow network, or start copying a large file to a USB thumb drive on Linux...

https://lwn.net/Articles/682582/

and

https://lwn.net/Articles/685894/

newnewpdro · on Dec 29, 2018

> And people realized the Linux I/O freeze is a variant of bufferbloat, and the very same ideas of the CoDel algorithm can be applied to the Linux I/O freeze problem.

There are myriad causes for poor interactivity on Linux systems under heavy disk IO. I've already described the two I personally observe the most often in another post here [1], and they have nothing at all in common with bufferbloat.

Linux doesn't need to do less buffering. It needs to be less willing to evict recently used buffers even under pressure, more willing to let processes OOM, and a bridging of the CPU and IO scheduling domains so arbitrary processes can't hog CPU resources via plain IO on what are effectively CPU-backed IO layers like dmcrypt.

But it gets complicated very quickly, there are reasons why this isn't fixed already.

One obvious problem is the asynchronous, transparent nature of the page cache. Behind the scenes pages are faulted in and out on demand as needed, this generates potentially large amounts of IO. If you need to charge the cost of this IO to processes for informing scheduling decisions, which process pays the bill? The process you're trying to fault in or the process that was responsible for the pressure behind the eviction you're undoing? How does this kind of complexity relate to bufferbloat?

[1] https://news.ycombinator.com/item?id=18784209

magicalhippo · on Dec 30, 2018

> It needs to be less willing to evict recently used buffers even under pressure, more willing to let processes OOM

I've had similar experiences. On Windows, where either through bugs or poor coding, an application requests way too much memory, leading to an unresponsive system while the kernel is busy paging away.

On Linux I've experienced the system killing system processes when under memory pressure, leading to crashes or an unusable system.

I don't understand why the OS would allow a program to allocate more than available physical memory, at least without asking the user, given the severe consequences.

newnewpdro · on Dec 30, 2018

Overcommit is a very deliberate feature, but its time may have passed. Keep in mind this is all from a time when RAM was so expensive swapping to spinning disks was a requirement just to run programs taking advantage of a 32-bit address space.

You can tune the overcommit ratio on Linux, but if memory serves (no pun intended) the last time I played with eliminating overcommit, a bunch of programs that liked to allocate big virtual address spaces ceased functioning.

magicalhippo · on Dec 30, 2018

Yeah, I know it was a feature at one point... but at least the OS should punish the program overcommitting, rather than bringing the rest of the system down (either by effectively grinding to a halt or killing important processes).

Dayshine · on Dec 29, 2018

I'm on Linux right now and can cause this to happen by using more than 60% of my RAM. Afaik no distros correctly handle swapping onto a HDD.

If I hit 80%, I get 10-20 second lock ups.

If I hit >95%, I get 1-2 minute lock ups.

Using Ubuntu 18.04.

opencl · on Dec 29, 2018

What is the "correct" handling of swap on an HDD supposed to be like? It is going to be slow no matter what you do. Windows also locks up for long periods of time if you use up almost all the RAM and it has to swap to HDD.

Dayshine · on Dec 29, 2018

When I say lock up I mean the UI completely stops. As in, my i3 bar stops updating for several minutes. Not even the linux magic commands let me recover.

On windows things may be unresponsive, but at least ctrl-alt-del responds, and at least the mouse moves!

The main difficulty is I can't tell if my machine has crashed vs is overloaded if the UI doesn't do anything for several minutes.

pritambaral · on Dec 29, 2018

> Not even the linux magic commands let me recover.

Are you sure you've set up the magic sysctls correctly? Ubuntu ships with magic oomkill disabled, by default.

On all machines I've tried, whenever I've needed it, magic oomkill has always worked, and I've been thankful of the fact that it's implemented down in the kernel.

zozbot123 · on Dec 29, 2018

If you renice the UI processes to a higher-than-normal priority, it should work more like it does in Windows. (This used to be somewhat risky, but today Linux UI is not going to hog your system resources.) The underlying problem is that when memory is tight, Linux starts evicting "clean" pages from the page cache that actually will need to be accessed shortly afterwards (i.e. they're part of the working set of some running process), and thrashing occurs. There's no easy solution to this issue, other than making user programs more responsive to memory pressure in the first place. (This could extend as far as some sort of seamless checkpoint+resume support, like what you see in mobile OS's today.)

Dayshine · on Dec 29, 2018

>There's no easy solution to this issue, other than making user programs more responsive to memory pressure in the first place.

Well, why can't distributions that come with recommended GUIs just set that high-than-normal priority by default?

I don't quite understand why the system can't keep what it needs to perform REISUB in memory at all times. Surely that's a tiny program!?

zozbot123 · on Dec 29, 2018

> Well, why can't distributions that come with recommended GUIs just set that high-than-normal priority by default?

Feel free to file bugs for your preferred distro. It would be especially appropriate to do this for critical UI processes like xorg/wayland or lightdm, and for "lightweight" desktops like xfce/lxde that aren't going to cause resource pressure under foreseeable conditions, even when run at higher-than-normal priority.

Dylan16807 · on Dec 29, 2018

"Correct" handling of swap would mean mostly leaving the window manager and its dependencies in memory. Individual application windows may stop responding, but everything else should be pretty quick. And small things like terminal emulators should get priority to stay in ram too.

ninkendo · on Dec 29, 2018

Not that I'd recommend this, but my work MBP has 16GB of ram, and my typical software development setup (JVM, IntelliJ, Xcode, gradle) easily uses up 30GB. It swaps a lot but generally OSX does a good job of keeping the window manager and foreground applications at priority so I can still use my machine while this is happening.

I attribute this to the fact that the darwin kernel has a keen awareness of what threads directly affect the user interface and which do not (even including the handling of XPC calls across process boundaries... if your work drives the UI, you get scheduling/RAM priority). I don't think the linux kernel has nearly this level of awareness.

zozbot123 · on Dec 29, 2018

> ... the darwin kernel has a keen awareness of what threads directly affect the user interface and which do not (even including the handling of XPC calls across process boundaries... if your work drives the UI, you get scheduling/RAM priority). I don't think the linux kernel has nearly this level of awareness.

You're talking about priority inheritance in the kernel. In Linux, this is in development as part of the PREEMPT_RT ("real-time") patches, already available experimentally in a number of distributions.

el_isma · on Dec 29, 2018

Install earlyoom, it will kill processes before a lock up occurs. Greatly helped me (at the expense of random chrome tabs killed)

mirimir · on Dec 29, 2018

I usually set vm.swappiness to about 20.

dralley · on Dec 29, 2018

I've experienced a total freeze before once one of my programs started swapping/thrashing. The entire desktop froze, not just the one program. This was in the past two years or so, so it's not a solved problem.

bnolsen · on Dec 30, 2018

my experience with windows is that it freezes the whole desktop under moderately multithreaded loads.

eMSF · on Dec 29, 2018

It doesn't? Oh right, you're bullshitting me.

Case in point: I recently tried unzipping the Boost library on an up-to-date Windows 10, and after trying to move the frozen progress window after a minute, the whole desktop promptly crashed. I have to say, the experience is better than it used to be, because at least the taskbar reappeared on its own. (Decompression succeeded on the second attempt after leaving the whole computer well alone ... but it certainly took its time even on a high-end desktop computer.)

dang · on Dec 30, 2018

Could you please leave personal swipes out of your comments here? They have a degrading effect on discussion and evoke worse from others. Your comment would be just fine without the second sentence.

https://news.ycombinator.com/newsguidelines.html