Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Craziest thing I ever used SQLite for: partial file deduplication (2022) (sqlite.org)
343 points by ics on March 26, 2023 | hide | past | favorite | 73 comments


I used a similar trick to stuff 14 Linux distros onto a 6 GiB sized SD card image using btrfs filesystem. The distros share a lot of common files, so this works well to save space. (btrfs also supports the data block sharing CoW feature as APFS)

https://xnux.eu/p-boot-demo/

I had to write a custom tool to do it efficiently during image creation.


This was neat. Found the important bits in https://megous.com/git/pinephone-multi-boot/tree/tools/main....


You, savage. Kudos.


That's really cool. There are a bunch of tools that will let you symlink or hard link deduplicate files, but being able to do block-level dedupes simply by leaning on the filesystem is nice.

It sometimes feels like games are made to thwart this type of thing. They often use packfiles, basically filesystems within files optimized to look up assets quickly. Also perhaps they allowed optimized data layout from when consoles had slow spinning hard drives. The upshot is that a tiny patch inserting a line of code in a script may offset hundreds of megabytes of other data in the packfiles, causing the block hashes to no longer match up. Do any filesystems model inserts in some way? I'm pretty sure Steam updates can handle situations like that. I frequently see updates which download a tiny amount (kilobytes) but write a huge amount to disk (gigabytes), and I can't think of any other cause. (Assuming developers aren't using hilariously un-compressed assets).


Yes, Steamworks encourages developers to align assets to certain boundaries to help ensure efficient patching later, see the documentation here: https://partner.steamgames.com/doc/sdk/uploading


Interesting, it looks like steam doesn't have any special handling for inserts/offsets and just handles 1MB blocks. The extra disk IO probably just comes from rewriting large files:

> Next - to update the pack file on a client device, SteamPipe builds the new version alongside the old version. When all new files are built, it then "commits" the update by deleting old files and moving the new files in. What this means is that to update a 25 GB pack file, SteamPipe will always build a new, 25GB file. If the update only requires 10 bytes of changes to that file, SteamPipe will need to copy almost the entire 25GB from the old file to the new file. Depending on the client storage hardware, this can be a very slow process.


I think SQLite + WAL could be a compelling solution for game patching scenario.

There are already SQLite clustering technologies that utilize WAL frame shipping in order to replicate data. I don't see why a steam patch couldn't just be a range of WAL frames to be applied to a client's database.


Yes, but I think this workflow is dependent on an assumption that there’s no cost overhead for random accesses, which only became true in the current console generation, with mandatory fast SSDs.

As long as you were potentially seeking around a spinning platter, you needed to have way more control over data locality than was possible with something like SQLite.


> There are already SQLite clustering technologies that utilize WAL frame shipping in order to replicate data.

Would you be ok to point out ones that do this? Searching online isn't showing up anything mentioning WAL frame shipping for SQLite. :(

Was thinking about exactly this scenario a few days ago, as I have a potential use for it. :)


Ahhh, at least Litestream seems to do this:

  https://litestream.io/how-it-works/
Awesome. :)


I think that train already sailed.

With HDD or SDD that might've been feasible but with NVMe you really want as little between raw storage and game engine.


Yeah, it would probably be easier to move back to individual files from packfiles. Assuming that the main motivation for packfiles was layout for HDDs and not filesystem overhead.


Initially packfiles were absolutely about filesystem overhead, "a lot of small files" were terrible performance-wise. OSes got better tho, don't know in windows side, but on linux side "few tens of thousands of small files" case really isn't that bad

But I'm really talking about ability to just directly load an asse via reading a blob of data vs having to munge it thru SQLite code. Or even just mmap'ing data directly


Windows is still extremely painful with many small files (at last the Win32 API, maybe going directly via Nt* functions is slightly better) and it only gets worse once you add "Antivirus" Malware that wants to hook into all FS access.


interesting, is this a supported SQLite feature? I understand that when the WAL file gets too big (e.g. 4mb), SQLite will then checkpoint its contents back to the main sqlite database. How can one generate a WAL file big enough to contain a game patch?


As far I can tell Steam does have two mechanisms. First is chunk based matching using rolling checksums, which is something like zsync/precomputed rsync. This allows to switch between arbitrary build manifests and resuses matching chunks from previous file. Second is based on computing patches with smaller granularity, but generation of these is only triggered when enough clients download the update. They are more efficient but valid only for updating between exact build manifest pair they were generated for.

When compressed packfiles are used they should be built with equivalent of gzip --rsyncable option, which restarts compression stream based on rolling hash values. Otherwise patching methods are defeated. For example UnityFS packfiles with chunked LZ4 compression normally don't do that, which results in huge updates. However it is possible to repack them into patch friendly form while keeping file format compatibility.


If there is any compression going on you don't really have a choice. Long ago I have actually done code that updated files (one field per record, no size changes) in a .zip file but the creation specifically forced the files in question to be stored with no compression.

(Legacy evolution--all the files of a job were packed into a single .zip. Then a need came along to update some of them. Nothing else would even notice that the files hadn't been compressed, but the one task could very quickly write the changes without replacing the file. Update the relevant byte, recalc the CRC from the in-memory copy, write it to both locations.)


Some games just add an update pak that overloads the changed files, not changing the original pak file. So the installation size of such games might grow a lot on updates, wasting a lot of space by keeping the old files, but the files are immutable at least and rollbacks of updates are theoretically possible.


Steam resisted any patching mechanism for updates for a long time. Even when they revamped the update system [0] they instead opted to split updates into separate files for their own pack format [1]. They seem to have relented though - I guess other developers were not as motivated to optimize their asset layout for Steam when Valve is the one paying for the bandwidth.

[0] https://developer.valvesoftware.com/wiki/SteamPipe

[1] https://developer.valvesoftware.com/wiki/VPK


What about using content defined chunking? Then inserting a few lines shouldn't cause all the subsequent chunks to shift


That's not quite right. The contents in the file would still shift. You'd just be able to still deduplicate most chunks between multiple files... But only in your copies because the filesystem still uses constant size chunks.


But I think they’re talking about if you’re doing content defined chunking of a pack or archive file - inserting data should only affect insertion chunks + 1. And since it’s content defined - those chunks are necessarily not constant size.

Ie this is how backup tools like Arq or duplicacy work.


Backup tools don't necessarily model mutation, since their purpose is to encode immutable snapshots of data. The modern ones that use deduped content storage also generally like to make snapshots independent of one another, so you can prune snapshots without worrying about invalidating another dependent snapshot. As a result, introducing a new version of a file likely has the same O(n) cost as backing up a new file, as long as they both contain content chunks already found in the storage area.

These backup tools also don't generally have to optimize random access to parts of a file. They may just store a linear sequence of chunk IDs to represent one file version. To bring this back to an active system that supports random access by a program with patching, I think you'd really need to adapt a copy-on-write filesystem to use content-defined chunking instead of fixed offset chunks. Then, your insertion is likely an operation on some tree structure that represents the list of chunk IDs as the leaves of the tree. But, this tree would now have to encode more byte offset info in the interior tree nodes, since it would vary at the leaves instead of each leaf representing a fixed size chunk.


Coincidentally, I just saw that the "USENIX Test of Time Award" for 2023 was an analysis for deduplication purposes of real-world files. I found Figure 1 particularly interesting in that partial deduplication didn't save much over whole-file based deduplication in practice.

https://www.usenix.org/conferences/test-of-time-awards


You'd expect that in general, but for developers updating content the partial would often win.


Very cool and probably a fun exercise, but I would probably put the data on a ZFS volume with dedupe instead, which from reading this implementation is pretty much the same thing to my layman’s perspective. I could also add compression to the same dataset.


> I would probably put the data on a ZFS volume with dedupe instead

But doesn't ZFS dedupe eat RAM for breakfast ? Double-digit GB RAM per TB data IIRC ?


The DDT (deduplication table) used to require ~1GB RAM for every 1TB data written over the life of the file system. Deleting data from the file system wouldn't remove the dedupe references, you'd have to recreate the pool entirely and start over with a new DDT.

However, there are now special devices that can be used stored to store DDT. Typically this is done with two SSDs configured as a mirrored vdev for the DDT metadata. This reduces the overhead on memory, but does cost some performance and still has the same limitation that the DDT size can only be reduced by re-creating the pool.


Yes this is what I do, and I wouldn't have dedupe for all datasets, for instance in this case he wanted dedupe on his games, so I would just enable it for that dataset.

But I do run disk-wide compression with no problems and have done so on all my datasets for many years now, and it's been a tremendous space saver. Especially on machines where I have a lot of VMs/containers, it's not unusual for me to have a compression ratio of 2 on these with good old lz4, it will be interesting to see what damage zstd will do once I start experimenting with that.


+1 for compression! Unless you have very specific performance needs there's no reason to not to use compression. In fact TrueNAS (previously known as FreeNAS) has it enabled by default.

I've yet to compare lz4 to zstd myself, but I've read great things about it.


It's usually along the lines of 1GB per TB. Some factors can affect that number but I've found it pretty accurate. Note that's 1GB of RAM that ends up wired to ZFS and never usable by the rest of the system.


No, only for deduplication, which is optional and doesn't make sense for most workloads. And usage depends on record (like block) size and other factors. Usually 1GB per TB of deduplicated storage is the max. ZFS works on datasets for configs like this, so you can dedup a small "filesystem" of data but not all 40TB on a pool.

It will use lots of RAM for ARC (adaptive read cache) but that can be limited.

ZFS RAM usage is greatly overblown in my opinion and experience.


Agreed. People forget how much more RAM we have now versus when ZFS was developed. It's like old emacs swapping jokes.


My recent 'copy-on-write' foot gun was using ZFS in ubuntu. There seems to be some automatic config that makes zfs snapshots each time you apt install something. This lead to wondering why on earth no matter how many files I deleted my disk usage was not reducing from 100%. All those deleted files still existed in the sanpshots. coupled with some bug that makes the snaps shots report their size as much smaller than they really were.


ZFS is fun but it locks up for unexplained reasons sometimes


I don't know if it's the same issue you have in mind, but I can 100% reproduce ZFS hanging and thinking a disk is unusable.

Steps to reproduce:

* Get an external HDD. I literally bought a new, different external HDD because I thought the problem was the old one (spoiler: nope, the problem still could be reproduced 100% on the new disk).

* Create a zpool for the whole disk.

* Create a dataset.

* Try to rsync several hundreds of GB to the ZFS dataset.

* Wait for a minute or two.

* Notice how it stops transferring data, and gives a weird error complaining the disk is unhealthy, faulty, or something (I don't remember the exact terms).

No amount of `zpool clear` or `zpool scrub` will fix it. I gave up and just formatted it as ext4 like all my other backup disks.

My use case for this was having this external HDD as a backup. The plan was to format this as ZFS, copy data from all my other external HDDs to this one, format the other external HDDs with ZFS, and then start rotating between them.

---

Another way to reproduce this is with torrents. When I downloaded torrents directly to an external HDD, it also hanged and got some errors, but in those cases it could be fixed with a `zpool clear` and scrubs, so it wasn't that bad (it wasn't literally unrecoverable, like the case I mention before).

---

So this leads me to believe there's something weird between ZFS, external HDDs, and trying to write too fast.

The whole point was to be able to run `zpool scrub` on those external HDDs. But like I said, I gave up on that for now. So the current plan is to try to build a NAS and do the same attempt, but with internal HDDs.


maybe failure in the connection or controller. But most likely the external disk was using smr (shingled magnetic recording), which shouldn't be mixed with zfs. there are different types of smr and some zfs-issues with smr-disks have been fixed in the past. servethehome.com has a detailed article (benchmarks) why those two technologies should not be mixed.


Any more info?

That sounds like a problem specific to the setup you were using. (?)

Saying that because if it was something that commonly happened, then either a) it would have been fixed, or b) people would have stopped using it. :)


If the specific setup is an external HDD (or maybe a very slow disk, and trying to write too fast to it), then I can make sense of parent's comment.

Like I mentioned in my sibling comment, I can 100% reproduce something that sounds like what parent mentioned (most recent attempt was like 1-2 months ago); but for my specific case, I can see how ZFS on an external HDD might not be that common.


I suggest playing with the ZFS tunings to add a write throttle. My hunch is your disk's buffer is filling and then blocking.


I think that there was an electrical problem (mismatched chargers on a laptop) in the first instance of mine.

Then there's encrypted datasets being very slow, especially when copying between two pools.

Then there's having two pools on the same disk corrupting each other. Though in this case the data might have been recoverable using some recovery tool (like a "modified zdb").

Also, I read on some website somewhere (maybe zfsonlinux.org) that USB devices probably shouldn't be used with ZFS.

Honestly I "wish" some bunch of crazy filesystem people would just clean room ZFS...


That sounds like a good idea, you just gave me new keywords to search for my next attempt.

Thanks!


Listen, people have to start using it to stop using it :=)


That's neat! In retrospect, did the database make the problem more tractable from an algorithmic or practical standpoint, or was it mostly just using the tools you're familiar with? If I were to approach the same problem, I likely would have tried to keep all the data in RAM and serialize it out to a file for incremental runs.


IME a database does not algorithmically improve anything, but it does make at your disposal some very efficient data structures that are definitely nontrivial to replicate. This is especially the case if you have to work with datasets that don't fit in memory!

An added bonus is that you can use SQL.


A friend works for a financial company that gets applicants from .NET and Linux backgrounds. He observed that the .NET devs almost always thought about solving these problems with a database and SQL that the Linux people would solve with pipelines of commands.

Both were valid in his environment since they were both a big Oracle and SQL Server shop while having high Linux usage as well. I must ask him if PowerShell has changed things at all.


Definitely the latter.

There are many of other products which implement "keep track of partial file contents to deduplicate", for example casync borg and restic.

None of them use sqlite for metadata storage, which kinda makes sense - in author's schema, each block uses (fileid, block_index, hash) tuple, which is 2/3 wasted space, as fileids are mostly same (if files are big) and block_index is always increasing. A custom data structure would be both faster and more efficient.


I have used SQLite to deduplicate csvs when it couldn’t all fit into memory.

I imagine Dask would be a better tool for this job but I wasn’t familiar with it at the time and it did allow me to deduplicate using Python and no external dependencies.


"There is no file path, because macOS lets you look up files by id. The hash values are cryptographic hashes truncated to 64 bits and reinterpreted as integers."

Is the author implying that APFS or HFS uses this method to calculate the file ID? I am unable to find any information regarding this. From what I understand, w.r.t APFS, the file ID is a combination of the inode OID and genID.


The hash values are part of his implementation and don’t have anything to do with APFS or HFS. The sentence after your quote says, “I picked the BLAKE3 hash function…” indicating it was something he chose, not a FS implementation detail.


That's what I was leaning towards. The wording is weird to me. Thanks.


I assumed they are taking about inodes



Also relevant in the context of partial file deduplication (to get robustness wrt inserts and deletions): https://en.wikipedia.org/wiki/Rolling_hash


Using proper a hash table with a bloom filter would save you the useless pass through a b-tree though. Much faster and much less memory


With a b-tree you can do even better. Instead of hashing entire files ahead of time, you can avoid hashing more than the minimum prefix required to conclude a file is unique, by building the b-tree lazily. I don't think it can be done with sqlite though, as it requires hashing the files while traversing the b-tree.

https://lib.rs/crates/dupe-krill#readme-nerding-out-about-th...



The other day I found out that not only ZFS and Btrfs have a dedup feature, but also XFS (which I use on my home partition) supports a feature called reflinks aka block sharing. So I threw duperemove at ~/.virtualenvs and ~/.cache/bazel and freed up a good 15GB of disk space. Good stuff.


Oh, hey Mark

(Not (purely) a reference to the “oh hey Mark” meme, I’m vaguely acquainted with Mark from LUG days)


Does anyone else have any other unorthodox use cases? I love SQLite, and am always happy to ram this square peg into a round hole.


PostgreSQL is Turing-complete, as proven by implementations of a cyclic tag system[0] and Turing machine[1]. The Mandelbrot set[2] and travelling-salesman problem[3] have also been implemented in it.

Transact-SQL is also Turing-complete, as proven by a Brainfuck implementation[4].

With that, you can theoretically compute anything :).

[0]: https://wiki.postgresql.org/wiki/Cyclic_Tag_System

[1]: https://blog.coelho.net/database/2013/08/17/turing-sql-1.htm...

[2]: https://wiki.postgresql.org/wiki/Mandelbrot_set

[3]: https://web.archive.org/web/20201111224603/http://assets.en....

[4]: https://stackoverflow.com/questions/900055/is-sql-or-even-ts...


Does any of the above apply to SQLite?


Coincidentally I just wrote this about using SQLite for managing i18n strings: https://dev.to/yawaramin/translations-without-the-tears-1n9k


Oh sweet. I've definitely used sqlite (in a zsh script) for file deduplication. Very simple, mostly just rows consisting of paths and file content hash.

But partial file deduplication is something else...


Reminds me of this article from Riot Games about how they deliver patches: https://technology.riotgames.com/news/supercharging-data-del...


I appreciate seeing this article that I wrote 4 years ago resurface :) Happy to answer any questions about it or the tech behind it.

We also use a SQLite database, and in a manner similar to the OP’s article. We use it to track which content-defined chunks are at which offsets in which files on disk. We deduplicate those chunks on the CDN so you have to download less data when updating the game, but on disk we need to recreate the files as originally uploaded because that’s how games load them.

We’ve expanded the use of this tech quite a bit since the article. For example, my team has started using the patcher internally to manage our 2-million-file vendored code and binaries repo, as a replacement for Git LFS of sorts. Cloning the repo from scratch became 10 times faster, and so did switching branches, especially for remote devs, since the files are served through CloudFront.

Some of the more interesting work we’ve done recently has been optimizing the core content-defined chunking algorithm. The GearHash rolling hash algorithm from FastCDC is decently fast, but we were able to improve chunking speeds by 5x with vectorized implementations and using AES-NI instructions in a clever way to avoid the GearHash table lookups. It can do 11GB/sec per core now using AVX-512 instructions.

Another thing we did recently was build a tool for repacking Unreal Engine Pak files so game assets maintain a consistent layout across versions, similar to what garaetjjte mentioned in a comment above about UnityFS files. This reduced the disk I/O needed to update VALORANT by over 90% and cut update times by half for players. The combination of content-defined chunking with tooling to make game data files friendlier to differential patching can make a huge difference.


Thanks for the link, that was a fun read.


This is quite amazing. I actually built something even more crazy with Sqlite where I broke up the files into parts and then made hashes of the parts, so in total it would use less space than the sum of the files' space. i used this for a similarity engine where I would try to see how similar different files were


I don't get it. How does this SQLite database interact with the AFS volume?


Slightly off topic -> there was a project I seen on gh that claimed to support system administration using a relational tables. Something like everything in SQL. I thought it might be a cool idea.



Looks familiar. Highly appreciated.


This sounds like it could be bolted onto/into rsync on the server side to present filesystems larger than the server can actually store.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: