I tried to do something similar well over a decade ago during an internal hackathon (the motivation back then being speeding up destructive integration tests). My idea was to have the memory be a file on tmpfs, and simply `cp --reflink` to get a copy-on-write clone. Then you wouldn't need to bother with userfaultfd or slow storage as the kernel would just magically do the right thing.
Unfortunately, the Linux kernel didn't support reflink on tmpfs (and still doesn't), and I'm not genius enough to have been able to implement that within 24 hours. :-)
I still believe it'd be nice to implement reflink for tmpfs, though. It's the perfect interface for copy-on-write forking of VM memory.
Glad to see the approach validated at scale! I hadn't seen your blog posts until they were linked here, going to dig into the userfaultfd path. Would love to chat if you're open to it.
The first version we launched used the exact same approach (MAP_PRIVATE). Later on, we bypassed the file system by using shared memory and using userfaultfd because ultimately the NVMe became the bottleneck (https://codesandbox.io/blog/cloning-microvms-using-userfault... and https://codesandbox.io/blog/how-we-scale-our-microvm-infrast...).