Even more importantly, does your data *have to* fit in RAM? There are tons of pr...

sevensor · on Feb 13, 2020

There's also the "not all memory is RAM" trick: plan ahead with enough swap to fit all the data you intend to process, and just pretend that you have enough RAM. Let the virtual memory subsystem worry about whether or not it fits in RAM. Whether this works well or horribly depends on your data layout and access patterns.

beached_whale · on Feb 13, 2020

Don't even need to do that. Just mmap it and the virtual memory system will handle it.

mrmrcoleman · on Feb 13, 2020

Interesting. Can you provide some examples of where this is the correct approach?

tmountain · on Feb 13, 2020

This is how mongodb originally managed all its data. It used memory mapped files to store the data and let the underlying OS memory management facilities do what they were designed to do. This saved the mongodb devs a ton of complexity in building their own custom cache and let them get to market much faster. The downside is that since virtual memory is shared between processes, other competing processes could potentially mess with your working set (pushing warm data out, etc). The other downside is that since your turning over the management of that “memory” to the OS, you lose fine grained control that can be used to optimize for your specific use case.

xchaotic · on Feb 13, 2020

Except nowadays with Docker / Kubr you can safely assume the db engine will be the only tenant of a given vm /pod whatever so I think it’s better to let OS do memory management than fight it

joshvm · on Feb 13, 2020

Might not be exactly the same use case, but a simple example is compiling large libraries on constrained/embedded platforms. Building OpenCV on a Pi certainly used to require adding a gig of swap.

jdc · on Feb 13, 2020

With the Varnish HTTP cache the authors started out with a very "mmap or bust" type of approach, but later added a malloc-based backend.

globular-toast · on Feb 13, 2020

Escapes many devs? Really? I used to work with biologists who thought they needed to run their scripts on a supercomputer because the first line read their entire file into an array. But if I saw someone who calls themselves a "dev" doing this I'd consider them incompetent.

saint_abroad · on Feb 13, 2020

I once got into an argument with a senior technical interviewer because he wanted a quick solution of an in-memory sort of an unbounded set of log files.

Needless to say I wasn't recommended for the job, and it taught me a valuable lesson: if you don't first give them what they want, you can't give them what they actually need.

ludamad · on Feb 13, 2020

Plenty of devs that don't do any sort of file streaming, say those who started with Game Maker or another specialized domain

EdwardDiego · on Feb 13, 2020

I've spent a lot of time writing Spark code, and its ability to store data in a column oriented format in RAM is the only reason why - disk is goddamned slow.

As soon as you're touching it more than once, sticking it in RAM upon reading makes everything much faster.