There is one note piece to the puzzle to make git perfect for every use case I can think of: store large files as a list of blobs broken down by some rolling hash a-la rsync/borg/bup.
That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.
git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.
Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.
Sometimes they do - e.g. if you replace a file in the ISO that is the same size up to block alignment, which is common when e.g. editing a text file or recompiling an executable with a minor change. They almost always do when it's a VM image representing a disk - only some blocks change every write.
However, with self synchronizing hashes of the kind used by rsync bup and borg, it doesn't matter - you could have a 1TB file, delete a single byte at position 100 - and you only need to store or transfer one new block (with average size 8KB for rsync, configurable for borg) if you already have a copy of the version before the change.
It's somewhat comparable with diff/patch but not exactly; it's worse in that change granularity is only specified on average; It's better in that it works well on binary files, does not require a specific reference diff (can reference all previous history), and efficiently supports reordering as well small changes - if you divide a 4000 line text file to four 1000-line sections and reorder them 1,2,3,4 -> 3,1,4,2 you will find the diff/patch to be as long as a new copy, whereas a self synchronizing hash decomposition will hardly take any space for the reordered file given the original.
So rsync uses 8KB for chunk size, so for a file 1GB it has 125 000 chunks. (And if every chunk needs 16 bytes of hash data to send, that's about 2MB, pretty darn efficient, especially if it can spot reorders.) Though according to Wikipedia it only does this if the target file has the same size, so adding new files to ISOs might not work in case of rsync, but still, the possibility is there for diff algos and version control systems.
No, target doesn’t have to be same size. As an optimization, if size and datetime are the same, rsync will assume no change and will not hash at all (though you can force it to).
But it will definitely use hashes when size differs (unless forced to copy whole files, or copying between local file systems)
It really depends on the type of file. ("Other large blob types" is a rather broad category.)
One obvious example where you could have a lot of common blocks (even following the offset where a change was made) is zip files. The zip format basically compresses each file individually and then concatenates all that together.
Let's say you have a build and it packages the results up as a big zip file. (Java builds often do this. A jar is a special type of zip file.) If you change a few source files and rebuild, and if your build is deterministic (and/or incremental), then the new zip file will contain a lot of the same stuff as the previous version. And if your zip archiver is deterministic (pretty safe assumption), it should produce a zip file that is mostly the same sequences of bytes as the previous zip file, even if there are changed files in the middle.
If you write a .tar.gz archive, then one change in the middle will throw everything off from that point on because it compresses the whole archive instead of individual files. In theory a binary diff can work around this by first undoing the gzip that was done to create each large blobs, then doing a binary diff on that, and then arranging to be able to recreate what gzip did. Obviously that's messy.
Of course, not every file is an archive. Some are filesystems. But any writable filesystem (notably not including ISOs) that is capable of being used on a hard disk will of necessity not rewrite everything. If it did, changing on one file on a filesystem would take hours because the rest of the partition would have to be rewritten.
Another obvious type of big blob is multimedia. I don't know a lot of specifics, but I would think file formats meant for editors would keep changes localized for reducing IO (for example, so that changes in a non-linear video editor don't need to write a giant file), but formats meant for export and delivery might change the whole file since they're aiming for small size.
They have a non essential copy of the directory at the end for spoed; tools exist to rebuild it if it is corrupted from the entries inside the file. But it is usually very small (the only real life exception I met is the hvsc archive where the directory size is very significant - so they zip it again)
That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.
git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.
Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.