Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pure speculation follows.

I wonder if they could be calculating hashes of files and sending them off? That would be useful for automated exfiltration and targeting.

For example:

    1. Calculate the SHA-256 hashes for files in places of interest.
    2. Report the hashes upstream.
    3. Hey, this file matches one that the FBI/NSA is looking for via NSL.
    4. Download more stuff. Also identify the person and their location.
    5. Send agents/drones after them.
This is unlikely, but still in the realm of possibility. It's also untestable without more information. (Packet captures from the DLP device would be far more helpful in determining if anything of the sort is happening.)


To calculate the hash, it needs to read the whole file, which this post claims it isn't doing.


I appear to have overlooked this detail. Good catch! :)


Did the author actually verify this with strace (or the mac/windows equivalent)?

It sounds like he guessed this based on I/O activity of the process. It could be enough to hash the beginning of the files, and compare the rest if a match is found in the database.


Dropbox doesn't read the file content. There is also no proof that Dropbox directly accesses those files.


Not really, one could get a unique enough hash by reading the first lets say 10,000 bytes of each file, and it would be faster than hashing the whole file

edit: here i was bored enough > http://pastebin.com/NJEvnG1d


I wrote something that was hashing audiobook files that was taking forever, so I tried using the first N bytes (likely much more than 10kB), but soon found that for any given audiobook, each chapter's MP3 had a large identical header on the front end - I imagine that it was a cover image embedded in the metadata.

I think in the end I just started taking the data from the end of the file, but if you're going with subsets, it's probably better to use a pseudo-randomly selected subset rather than a sequential subset. It doesn't have to be a different pseudo-random subset for each file, but I imagine there's an ideal noise profile in the sampling (maybe white noise is best).


Of course you are correct (not sure why my comment was downvoted) but in the context of having a unique enough hash TAKEN QUICKLY 99.999% of time in set of millions of files its good enough, if one needs better hashing they can hash the whole file but this is quite heavy on large files and pointless if there is no need for it by the application


Maybe:

    File > 12 KB: First 4 KB, last 4 KB, middle 4 KB
    File <= 12 KB: Just hash the damn file


Hey, this file matches one that the FBI/NSA is looking for via NSL.

Or why not more mundane: another user shares a file with you. Dropbox knows that you already have the same file somewhere on your filesystem outside the Dropbox folder (or a partial match). It doesn't have to transfer that data to you.

But I agree with those saying it's probably a result of some implementation issue (Finder extension or working around some shortcoming in monitoring just the Dropbox directory).


This is pretty close to how their deduplication used to work, and isn't too different than how rsync works.

Of course, this gave rise to the ability to transfer files (even non-public files) quickly between Dropbox accounts provided knowledge of the hashes of its chunks, and Dropbox has since changed their deduplication. See https://github.com/driverdan/dropship


Why would a TLA look for a file with a known (hashable) content? Is that common? Is that to be seen as looking for "contraband", i.e. if you have a certain file, for instance some known part of a rookit, you might be an evil attacking hacker? I don't quite follow.

Also, the typo "Drobpox" was fun, that's a good alias when feeling suspicious. :)


Tracking spread of leaked documents?


Or things like child pornography. It isn't all about state secrets, y'know.


Wouldn't they need to have the file already if they know its hash?


"Hey Dropbox, go find everyone who has a copy of a file with this SHA-256 hash, it's for national security purposes. Thanks"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: