Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Persistant and unblockable browser cookies using last-modified HTTP header (nikcub.appspot.com)
87 points by nikcub on Aug 19, 2011 | hide | past | favorite | 32 comments


As long as caching exists, there will always be tricks like this. Here's another example: Encode a unique ID in an image and send that to the browser with a long cache time. Then on subsequent requests, use JavaScript canvas to read the image and decode the unique ID from it.

I've already completely disabled disk caching in Firefox by setting browser.cache.disk.enable to false. I'm seriously considering disabling the in memory cache too at browser.cache.memory.enable.

EDIT: In fact, I've just gone and disabled in memory caching. Will be interesting to see if this change is noticable in my normal usage.


Even the Canvas/JS is not necessary. Have your website load up a styles1.css file which is really generated by a script and has long expiry; that can then in turn @import a styles2.css embedding some unique ID in the imported URL (alternatively reference an image with the unique ID). As usual you would try to identify the user by existing cookies first, so this is a fallback.

The styles2.css has little or no caching enabled, so that it's requested reasonably often by the browser; it can then set a cookie or you can just correlate the information on the server side.

That doesn't require Javascript, just CSS (if requiring JS is OK, you could serve a some innocently named "helper_functions.js" that sets windows.uuid = "some unique value" and force it to be cached).

The site that started this: http://samy.pl/evercookie/


I agree that caching and privacy are mutually exclusive. On the other hand disabling the cache generates unnecessary network & server load.


The plugin mentioned at the end won't work - not unless it's going to have a tedious whitelist.

Since many sites use "3rd party requests" like remotely hosted images, fonts, and even for example google based jquery, blocking them would most certainly break the page.

What I would like is a warning when the last-modified appears to be a malformed date or not a date at all. PHP's strtotime can convert almost anything to a date/time (if it really is a date). Can it be ported? Optionally when it doesn't appear to be a date within the past 30 days, never cache it.

Another solution for a pure privacy mode is to never send back last-modified, ever. It would hurt servers and page load times because things would never be cached and always served as "200" instead of "304" but for a pure privacy mode, may be necessary.

Last but not least, we could just dial down our cache time limit to say an hour max. It would still give some info to the trackers but not between browser sessions or the computer turned off. Since firefox doesn't have a time limit on the cache, using a memory-only cache is the only easy workaround for now.


Http defines the date format precisely (although there are 3 valid formats) and browsers already can parse it for other purposes, like cache expiry. http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html I imagine this bug will be fixed pretty quickly as the http replies being sent are out of spec.


Killing the If-Modified-Since header probably doesn't substantially increase privacy compared to alternatives. Remember, the header doesn't stick around unless the browser caches the response anyways.

For example, Google Chrome's incognito mode kills this cached "cookie" along with all other legitimate cookies when you close your browser (clearing the cache), just as you expect with other cookies.


Re blocking 3rd party requests. There is already a well known Firefox addon which does this called RequestPolicy. If you have the patience to spend 10 seconds training it the first time you visit each new website, it protects you from all sorts of bloat, tracking and xss/csrf issues.


>The plugin mentioned at the end won't work - not unless it's going to have a tedious whitelist.

RequestPolicy (Firefox addon) works just fine, not much tedious than NoScript or CookieMonster, for example.

While remotely hosted images may be a strong argument, if the page breaks due to missing fonts or scripts - it's really only site's fault.


Very interesting find, one of those that when you read the title you immediately realize its so obvious, just you never thought about it before!

One thing to note, the date could still be used to identify using any date, say 01/17/2157 identifies you. What could be done is to restrict the date to (present time - 3.months) <= last-modified <= present time.

That effectively reduces the number of people they can track to 7776000. Rounding off the seconds would reduce it to 129600.


Ensuring that if-modified-since is a proper date might not help much there. The evil site could just encode the unique id info in the time (and/or date) portion of the tag. A possible workaround would be for the browser to randomly send time a few seconds in the past, so the cache still mostly works, but that the value can't be reliably used for anything else.


Correct me if I'm wrong, but I would imagine web servers rely on the current implementation as much as browsers do. That is, they just do a string comparison with the last-modified date on the server. This seems way easier to implement and might even be more correct (it would detect if last-modified date has changed between dates in the past, which might signify something fishy has happened).

Perhaps someone in the know can tell us: Do the major webservers out there rely on string comparison or do they do they really parse out dates?


They rely on dates - they have to since they need to work out if one date is after another.

Lexicographical do not work with inconsistent (but still valid) date formats: eg 9/08/2011 sorts after 10/08/2011 if sorting lexicographical.

In actual case, most server software probably relies on the date format being in W3C format. If the parsing fails (which it would in the case of 9/08/2011) then the server ignores the date instead of attempting a lexicographical sort.

(I've never written a web server, but have written both client and server side software that works like this.)


Why do they need to work out whether the date they get back is before or after the one on the server? If it changes at all, presumably the browser has the wrong version so they should send down a 200 response with the correct version. If all the server needs to do is check whether the date has changed, and all the major browsers just return back whatever they got sent, why bother with parsing dates at all?


The server needs to parse the date sent by the browser to compare it against the date the file was last modified (which is retrieved from the operating system, probably as UNIX time).


As mentioned by sirclueless, that's not necessarily the case. Since last-modified is sent by the server and simply replayed by the browser, the server can simply make sure the two strings are identical.

Browser: "I want index.html"

Server: "index.html was last modified on 'Pungenday, the 9th day of Bureaucracy, 3177 at 14:53'. Here are its contents."

[Time passes]

Browser: "I want index.html. I have a cached copy that you said was last modified on 'Pungenday, the 9th day of Bureaucracy, 3177 at 14:53'. Is there a newer version?"

Server: "Nope, your Last-Modified is the same thing I would tell you if I sent it right now."

Note that because the server sets the contents of last-modified for the browser, it can simply check if it's identical to what it would currently send as the last-modified header for that request.


> Server: "Nope, your Last-Modified is the same thing I would tell you if I sent it right now."

To get that information the web browser must ask the file system when the file was last modified and compare it. It is recommended to do a "submitted time is less than time now" on the file (which means the date must be parsed), by the RFC (2616 14.25); however, some people do use inequality operators as you suggest.

In this case, it would be the web server that is not following the RFC (sending arbitrary Last-Modified header), not the browser.


I've written server-side software that generated a meaningful Last-Modified header (it's very useful information) and also did an exact comparison instead of a time-based one. I did the exact comparison because I realized that it was very hard to guarantee correct results otherwise (correct being that I never 304'd a request unless the client already had the exact version of the page that I would have served). The problem is that there are a number of situations in a web server that can cause page modification time to go backwards. For example, several ways of doing rollback to previous versions of content will also roll back the page time, such as simply renaming an old version of the file to the current name.

To do correct time-based If-Not-Modified comparisons you really need to guarantee that all changes moves your Last-Modified time forward, no matter what. My view is that this is surprisingly hard once you start looking at corner cases. Certainly it's not something that a web server that serves general file content can ever guarantee; there are too many ways to shuffle files around behind the web server's back.


If we force If-Modified-Since to be in the last year and limit it to 1 minute resolution, that leaves 19 bits worth of uniqueness for the site to play with - probably not enough for tracking on a large affiliate network?


Can n image GETs (or similar) be used to produce 19*n bits of uniqueness, presuming that the affiliate network is willing to handle a factor of n more requests?


Oof, you're right. These are better than cookies because you can set them for specific URLs. Correlating requests wouldn't be too hard, just look for consecutive requests from the same IP with the same referrer.


Perhaps, but it's still 19 bits that can be set independently of any other identifying traits like the User-Agent string.


I am surprised that it works like that. I would have expected the browser to just remember the last time they accessed a resource and send that time along.


That seems like a pretty good idea to me; would make the last-modified response header unnecessary. I'm guessing it's not done this way because of synchronization issues? (e.g., if someone has their computer's clock set to the year 2025, they can never refresh on servers that use the header)


That is a surprisingly clever and evil hack.


>The privacy plugin that I am working on, Parley, would solve the cross-site tracking aspect of this bug, since it blocks all third party requests.

https://addons.mozilla.org/en-US/firefox/addon/requestpolicy...


should have mentioned that it is a Chrome extension

and the reason why I haven't released it is because I am experimenting with a number of features such as cookie rewriting, cache invalidation by rewriting requests, forcing SSL, etc.


As the maintainer of an open source proxy (Seeks), I'm playing with randomizing the last-modified header, on demand. The randomizing procedure is triggered by a regexp over the requested URL.

Does anyone here know if (and where) a useful list of websites that use such tracking methods has been compiled ?


Just remove the Last-Modified header from all in-coming server responses? And remove the If-Modified-Since header from all browser requests? If these can be set to arbitrary strings without affecting browser performance, user experience etc why have these headers at all?


I'm not a fan of virtualization for apps that are maintained and should be able to coexist under one kernel, but I may reconsider since browser and plugin vendors are still not offering a permanent Chinese wall between my activities on separate sites.


Because the sites themselves don't work with a Chinese wall between sites. Yourfavoritesite.com is probably loading images from images.yourfavoritesite.com, or possibly akamai.net, and scripts from jquery.com. Making a browser able to distinguish between that and black-hat cross-site stuff is an extremely difficult task that we still haven't gotten quite right.


I wish we had sandboxed tabs. Ie you create a sandboxed tab instead of a normal tab, it has its own cache, it's own cookie store etc. I could open my bank website in a separate sandboxed tab and not have to worry about sites in other tabs hitting it with CSRF attacks etc.


As a UK based web developer, All these esoteric tracking methods makes me hate on the ICO's recent idiocy even more.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: