Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked at a storage company and the scariest thing I learned is that your data can be corrupt even though the drive itself says that the data was written correctly. The only way to really be sure is to check your files after writing them that they match. Now whenever I do a backup, I always go through them one more time and do a byte-by-byte comparison before being assured that it's okay.


This is true. Which is why we really, really need checksummed filesystems. I am very worried that this hasn't made its way into mainstream computing yet, especially given the growing drive sizes and massive CPU speed increases.


I run a 10x3TB ZFS raidz2 array at home. I've seen 18 checksum errors at the device level in the last year - these are corruption from the device that ZFS detected with a checksum, and was able to correct using redundancy. If you're not checksumming at some level in your system, you should be outsourcing your storage to someone else; consumer level hardware with commodity file systems isn't good enough.


You know, I'm not really sure I buy this. I worked for a storage company in the past, and I put a simple checksumming algorithm in our code sort of like the zfs one. Turns out that two or three obscure software bugs later that thing stopped firing randomly, and started picking out kernel bugs. Once we nailed a few of those the errors became "harder". By that I mean that, we stopped getting data that the drives claimed was good but we thought was bad.

Modern drives are ECC'ed to hell and back, on a enterprise systems (aka ones with ECC RAM and ECC'ed buses) a sector that comes back bad is likely the result of a software/firmware bug somewhere, and in many cases was written (or simply not written) bad.

Put another way, if you read a sector off a disk and conclude that it was bad, and a reread returns the same data, it was probably written bad. The fun part is then taking the resulting "bad" data and looking at it.

Reminds me of early in my career a linux machine we were running a CVS server on once or twice a year reported corrupted CVS files, and when looking at them, I often found data from other files stuck in the middle, often in 4k sized regions.


> you should be outsourcing your storage to someone else

Well, I'd need to be sure that "someone else" does things properly. My experience with various "someone elses" so far hasn't been stellar — most services I've tried were just a fancy sticker placed onto the same thing that I'm doing myself.


How does checksumming help if the data is in cache and waiting to be written ? For ex: I have 1MB of data, i write it but it stays in buffer cache after written and when you do checksum you are computing the checksum on buffer cache .

On Linux you have to drop_caches and then read get the checksum to be sure. Now per buffer or file drop_cache isnt available as per my knowledge . If you are doing a systemwide drop_caches you are invalidating the good and bad ones.

What if now if device is maintaing cache as well in addition to buffer cache?

Can someone clarify ?


How do you know you put good data into the cache in the first place?

There's always going to be a place where errors can creep in. There are no absolute guarantees; it's a numbers game. We've got more and more data, so the chance of corruption increases even if the per-bit probability stays the same. Checksumming reduces the per-bit probability across a whole layer of the stack - the layer where the data lives longest. That's the win.


Agree whole heartedly.

I was asking this thinking of open(<file>,O_DIRECT|O_RDONLY); that bypasses buffer cache and read directly from the disk that atleast solves buffer cache i guess. The disk cache is another thing ie if we disable it we are good at the cost of performance.

I was pointing that tests can do these kind of things.


I advocate taking it further with clustered filesystems on inexpensive nodes. Good design of those can solve problems on this side plus system-level. Probably also need inexpensive, commodity knockoffs of things like Stratus and NonStop. Where reliability tops speed, could use high-end embedded stuff like Freescale's to keep the nodes cheap. Some of them support lock-step, too.


GlusterFS now supports its own (out of band) checksumming. So you could have a Btrfs brick, and an XFS brick to hedge your fs bets. And also setup glusterfs volumes to favor XFS for things like VM images and databases, and Btrfs to favor everything else.


Neat idea. Appreciate the tip on GlusterFS.


As a counterpoint I have 6x3TB zfs raidz2 on freebsd at home. I resilver every month and only had one checksum error that turned out to be a cable going bad given it hasn't repeated.

Still agree with we need checksumming filesystems though. That and gcc ram to make the data written more trustworthy.


had one checksum error that turned out to be a cable going bad given it hasn't repeated

I wouldn't assume that it was a cable error. The SATA interface has a CRC check. So the odds are quite high that a single error would simply result in a retransmission.

Of course, a plethora of detected SATA CRC errors and resulting retransmissions means that an undetected error could readily slip thru. There should be error logs reporting on the occurrence of retransmission, but I'm not enough of a software person to know how possible / easy it is to get that information from a drive or operating system.

OTOH, as you mention later in your post, a single bit error in non-ECC RAM could easily result in a single block having a checksum error. Exactly what you saw!


Hard drives also have spare sectors, so if a defect is detected at one spot in the disk, it will probably never touch that spot again.

Simply observing that an error only occurred once does almost nothing to narrow down the possible causes. You have to also be keeping track of all the error reporting facilities (SMART, PCIe AER, ECC RAM, etc.).


I simply don't have the upstream bandwidth necessary to back up 1TB (my estimate of essential data) offsite - it'd take months and take my ADSL line out of use for anything else.

I also have expensive power costs so running something like a Synology DS415 would cost $50 in power a year while barely using it - although that's better than older models.


Did you get any details on these 18 errors? Were they single bit flips?


No, unfortunately. I can't rule out the possibility of physical bus errors (like cable going bad or poor physical connection - in my case, there is one fairly expensive SAS cable per 4 drives, as I'm using a bunch of SAS/SATA backplanes with hotswap caddies); I do think that's probably more likely (or non-ECC RAM bitflip) than on-disk corruption.

But the exact nature of the problem is a distinction without a huge amount of difference to me. If I was copying those files, the copies would be silently corrupt. If I was transcoding or playing videos, the output would have glitches. Etc.

With this many HDDs, there are necessarily more components in the setup, and more things that can go wrong. Meanwhile, I'm not a business customer with profitable clients I can sell extra reliability to, so it's not the most expensive kit I could buy. I went as far as getting WD Red drives, and even then they were misconfigured by default, with an overly aggressive idle timer (8 seconds!) that needed tweaking.

The main thing is: more and bigger drives means increased probability of corruption.


Fortunately, zfs on Linux is excellent, and is a two-liner on modern Ububtu LTS. (add PPA, install zfs.)


Is it? I've heard several complains about bugs in FUSE.


ZFS on Linux[0] doesn't use FUSE.

[0] http://zfsonlinux.org/


Thanks, I was unaware. Apparently there is both native ZFS, and FUSE-backed ZFS for Linux:

https://en.wikipedia.org/wiki/ZFS#Linux


I've a friend that uses it. Can't say it is not buggy, he hit the bug where unlinked files weren't removed. Got to 95% use before finding out he had to reboot/unmount to clean things up.

But that means his pool performance is now shit.


    > The only way to really be sure is to check your files after
    > writing them that they match.
This is assuming that the underlying block device would forcibly flush those queued writes to disk and then re-read them again rather than just serve them up directly from the pending write queue directly without flushing them first.

You generally can't make that assumption about a black box, so reading back your writes guarantees nothing.

Unless you're intimately familiar with your underlying block device you really can't guarantee anything about writes going to physical hardware. All you can do is read its documentation and hope for the best.

If you need a general hack to that's pretty much guaranteed to flush your writes to a physical disk it would be something like:

    After your write, append X random bytes to a file where X is
    greater than your block device's advertised internal memory, then
    call fsync().
Even then you have no guarantees that those writes wouldn't be flushed to the medium while leaving the writes you care about in the block device's internal memory.


This is why end-to-end data integrity with something like T10-PI is a necessity. The kernel block-layer already generates and validates the integrity for us, if the underlying drive supports it, but all major filesystems really need to start supporting it as well.


I don't think that's a necessity for all workflows. Just think about it, that would require all of us buying enterprise 520 or 528 byte sector drives to store the extra checksum information, and a whole new API up to the application level to confirm, point to point, that the data in the app is the data on the drive on writes, and the data on the drive is the data in the app on reads. It's not like T10/PI comes for free just by doing any one thing, it implies changes throughout the chain.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: