There is one note piece to the puzzle to make git perfect for every use case I c...

pas · on March 14, 2020

Do ISOs and other large blob types support only partial (block) modification? Wouldn't all subsequent blocks change too?

beagle3 · on March 14, 2020

Sometimes they do - e.g. if you replace a file in the ISO that is the same size up to block alignment, which is common when e.g. editing a text file or recompiling an executable with a minor change. They almost always do when it's a VM image representing a disk - only some blocks change every write.

However, with self synchronizing hashes of the kind used by rsync bup and borg, it doesn't matter - you could have a 1TB file, delete a single byte at position 100 - and you only need to store or transfer one new block (with average size 8KB for rsync, configurable for borg) if you already have a copy of the version before the change.

It's somewhat comparable with diff/patch but not exactly; it's worse in that change granularity is only specified on average; It's better in that it works well on binary files, does not require a specific reference diff (can reference all previous history), and efficiently supports reordering as well small changes - if you divide a 4000 line text file to four 1000-line sections and reorder them 1,2,3,4 -> 3,1,4,2 you will find the diff/patch to be as long as a new copy, whereas a self synchronizing hash decomposition will hardly take any space for the reordered file given the original.

pas · on March 15, 2020

Oh, I used rsync many times but I thought it simply retransmits changed files. (Oh, it needs the --checksum argument to do this, okay.)

So how do these self-synchronizing hashes work? Like a Merkle Tree? (Ah, okay https://en.wikipedia.org/wiki/Rsync#Determining_which_parts_... )

So rsync uses 8KB for chunk size, so for a file 1GB it has 125 000 chunks. (And if every chunk needs 16 bytes of hash data to send, that's about 2MB, pretty darn efficient, especially if it can spot reorders.) Though according to Wikipedia it only does this if the target file has the same size, so adding new files to ISOs might not work in case of rsync, but still, the possibility is there for diff algos and version control systems.

beagle3 · on March 15, 2020

No, target doesn’t have to be same size. As an optimization, if size and datetime are the same, rsync will assume no change and will not hash at all (though you can force it to).

But it will definitely use hashes when size differs (unless forced to copy whole files, or copying between local file systems)

adrianmonk · on March 14, 2020

It really depends on the type of file. ("Other large blob types" is a rather broad category.)

One obvious example where you could have a lot of common blocks (even following the offset where a change was made) is zip files. The zip format basically compresses each file individually and then concatenates all that together.

Let's say you have a build and it packages the results up as a big zip file. (Java builds often do this. A jar is a special type of zip file.) If you change a few source files and rebuild, and if your build is deterministic (and/or incremental), then the new zip file will contain a lot of the same stuff as the previous version. And if your zip archiver is deterministic (pretty safe assumption), it should produce a zip file that is mostly the same sequences of bytes as the previous zip file, even if there are changed files in the middle.

If you write a .tar.gz archive, then one change in the middle will throw everything off from that point on because it compresses the whole archive instead of individual files. In theory a binary diff can work around this by first undoing the gzip that was done to create each large blobs, then doing a binary diff on that, and then arranging to be able to recreate what gzip did. Obviously that's messy.

Of course, not every file is an archive. Some are filesystems. But any writable filesystem (notably not including ISOs) that is capable of being used on a hard disk will of necessity not rewrite everything. If it did, changing on one file on a filesystem would take hours because the rest of the partition would have to be rewritten.

Another obvious type of big blob is multimedia. I don't know a lot of specifics, but I would think file formats meant for editors would keep changes localized for reducing IO (for example, so that changes in a non-linear video editor don't need to write a giant file), but formats meant for export and delivery might change the whole file since they're aiming for small size.

pas · on March 15, 2020

So ZIPs don't have any "global" directory thing? :o

labawi · on March 15, 2020

They don't have a global compression dictionary thing.

Similar effect can be achieved with gzip --rsyncable, which IIRC resets the dictionary based on a rolling sum.

beagle3 · on March 15, 2020

They have a non essential copy of the directory at the end for spoed; tools exist to rebuild it if it is corrupted from the entries inside the file. But it is usually very small (the only real life exception I met is the hvsc archive where the directory size is very significant - so they zip it again)