The only thing that fsync() does is take data which is already in the operating system buffers associated with a file descriptor, and write it physically to the disk. If the thread that runs every 5 seconds is really just fsync'ing, then when the process gets killed, any un-synced data is still in those buffers and will be available to read when it restarts, exactly the same as if fsync() had been called. The OS is responsible for maintaining a consistent view of the filesystem such that, unless the machine totally fails (or you circumvent the FS and actually inspect the live block device), fsync() appears to be a no-op.
It sounds like what you're saying is that the documentation is wrong, and that the translog thread is actually pulling data from Elasticsearch's internal buffers and writing it to files. In that case, the documentation which refers to that operation as "fsync" is very badly misleading because it disguises what failure modes it's actually protecting against.
The point of a return from fsync is that you are guaranteed the file has been written to disk[1]. If you don't block on fsync, you can't guarantee the file was written to disk, because the server may have died in any number of ways.
[1] This guarantee occasionally fails too; If you have a battery-backed NVRAM RAID controller, the guarantee is that the write has hit the NVRAM controller with the expectation that it will hit a disk before the battery dies. Throw in a 72 hour power outage, a controller failure, or a massive disk failure, and you can't even guarantee that.
No, I understand that. Maybe I'm not explaining my point properly, so I'll try again:
If you issue a write() syscall from a process, and the syscall succeeds, then the data that was written is present in the OS's cached view of the filesystem, even if the process dies a nanosecond later. That view is shared consistently by all processes on the system. It's true that the changes may not actually be stored persistently on disk, but that difference is unobservable unless something happens to make the kernel lose its cached data.
So from the test suite's point of view, unless part of the test involves actually killing VMs and not processes, it should not be possible for the results to depend on whether or not fsync() was called.
It sounds like what you're saying is that the documentation is wrong, and that the translog thread is actually pulling data from Elasticsearch's internal buffers and writing it to files. In that case, the documentation which refers to that operation as "fsync" is very badly misleading because it disguises what failure modes it's actually protecting against.