Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
On Cache Invalidation: Why Is It Hard? (2018) (yihui.org)
110 points by steventey on April 5, 2021 | hide | past | favorite | 70 comments


Invalidating caches is super easy when you don't have to do it. You just pick some element of the cache and blow it away; nothing is affected other than performance.

What is hard is invalidating a cache in that situation when an item must be removed from the cache, because it no longer exists outside the cache. If that is not done, you get a stale cache entry, which is incorrect and possibly breaks the system.

So now suppose the cache is multi-threaded, distributed and whatever. Now you have edge cases and race conditions. Is it okay if the item is gone, but the cache remains for just a little window of time? If not, can we remove it from the cache first, then for real? What if in between those two, something accesses it so it reappears in the cache? Pardon me, did I say the cache? I meant seventeen different caches throughout the distributed system ...


And now imagine all of this logic has to be implemented in the form of silicon transistors that cannot be changed without incurring many millions of $ in losses if a tiny bug slips through. And the tiniest bug will most likely make every application under the sun crash within a few seconds, or at least deliver wrong results.

That's basically what happens today in bleeding-edge CPU architecture development.


In the area of caching among multiple cores, we have well-researched cache coherency protocols that can be implemented.

We can assume all the processors that are relevant are online; no processor core will try to invalidate an item while being cut off from communication, which is then re-established later. (If a processor comes online dynamically, we can assume it's coming in with a clear cache.)

Bus snooping is possible: every cache can see every memory access, and based on that, it can invalidate (or even update) entries that would be made stale due to stores. In a distributed system, no such thing is possible; you can't feasibly have 10,000 geographically distributed nodes all watching every transaction.

We can also assume that any remaining race conditions are handled by the software: that the software will use tools like memory barriers in the right places; we don't have to solve every possible concurrency problem that interacts with caching effects.


Careful, memory barriers/fences have nothing to do with caches, they're for establishing an observed order on when the processor is using out of order execution. The cache still has to be strongly consistent at all times or else there's nothing that a memory barrier can do to save it.


>Is it okay if the item is gone, but the cache remains for just a little window of time? If not, can we remove it from the cache first, then for real?

This is typically called eventual consistency [1] which is common when working with distributed systems, the same way you may write an entry and doesn’t become immediately available. This is often the case on my day to day work and you “just” have to work under the assumption that data isn’t the most recent sometimes.

[1] https://en.wikipedia.org/wiki/Eventual_consistency


Eventual consistency, if I recall correctly, stands for “race conditions”. During the NoSQL craze I read one opinion piece on it that I thought was pretty well put together. It boiled down to this: eventual consistency data stores are easy to implement so lots of people create them. But they are not easy to use, just easy to get started on. They are fine until they aren’t and they are overused as a solution to anything distributed.

I personally much prefer to start with a system that has strong consistency guarantees and then relax those guarantees exactly where necessary. Trying to work it from the other end will show you exactly why data store developers often choose EC even if it’s not the best pattern.


It's perfectly fine to use an EC system for many use cases. Caches are a perfect example. If one reader happens to get a stale entry from a cache, but that stale entry is sufficient for your task, why pay the price of transactions?

As an example, I build a system with exclusively commutative, restricted operations. What that means is that I can look at an item and know, regardless of when that item was written, that only certain operations will be applied in certain ways.

To serve a query, such as "Find me the item with an element X greater than 100", I don't need to find every item's consistent state of X. If an item's X from cache is, say, 50, and I know that X is restricted to only ever shrink in value, I don't need to hit the database for a consistent view of X.

Application level constraints like this are much more powerful than database transactions and radically more efficient, which is why EC systems can perform so well.


To establish that using an EC system is "perfectly safe" for any given use case, stronger properties are required than mere EC since that by itself provides no safety guarantees at all, only an assertion that data will "eventually" be reconciled in some arbitrary way. One example would be "strong eventual consistency" as provided by CRDT's ("conflict-free replicated data types").


Yep, exactly. It's actually pretty trivial for some work. If you can do that, you get the best of both worlds. Lots of queries can be answered with stale data, which means you drop the massive overhead of strong consistency and transactional workloads.


Not if you are caching permissions. Or bank account balances. Or game scores. Or a million other things that are not Facebook comments.


OK, don't do that? Or do it carefully.

We do it to answer very correctness-sensitive questions around security, but it doesn't matter because stale answers are still valid if you know how those answers could have possibly been updated - and that's just an application invariant.


In that very specialized case this works. Does your system have any guarantees beyond eventual consistency where eventual could mean hours or days?

My point is that in general, EC is not a feature. Nobody sets out hoping to find a database that provides EC. They usually set out to find a database that can be globally distributed and have strong ACID guarantees. When confronted with various cost constrains they eventually settle for a system that makes trade offs where part of the price is EC. They then work around EC, usually not completely but enough that most of the time the system works fine. But EC is not in and of itself good or desirable, it’s just a less of several evils. Moreover, of the evils that it does compete with its necessarily the least, just the easiest to implement and as a result the most popular.

Your argument of “it works in this one case and it works well” is a bit of a straw man in that no cache at all also works in some cases, but that doesn’t make it a general solution. I have successfully used an EC system for a decently sized (at the time at least) dataset and it worked well but it was only because that particular workflow naturally allowed for EC semantics (streaming updates every few seconds/source). But I sure as hell wouldn’t want to build a bank on EC.


> In that very specialized case this works.

The specialized case is where you can ensure a few application level constraints about your data, which so far in my experience are extremely valuable constraints. It's maybe a less trotted road, but not a difficult one.

The benefit is massive improvements to performance and reduction in complexity - you eliminate the need for a complex consensus system.

It is far more than "it works in this one case", it is that EC removes a massive cost in databases that is often unnecessary - transactional logic, and in return it gives huge improvements in other areas. Specifically, and relevant to this thread (because the article is on caching), in the area of caching this is particularly desirable.

This is, as another user mentioned, called strong eventual consistency.

Your comment compared EC to race conditions, which I think is quite a negative way to view them, so I wanted to point out that EC is not "strictly worse" or buggy or whatever.


I think you are saying what I am saying: in specialized cases EC is fine and a good cost saving measure.

My only addition to that is that it’s popularity may be due to ease of developing EC databases vs ones with distributed consensus algorithms, and that personally I prefer to start with a system not based on EC, then add EC where necessary whereas it sounds like you prefer to start with EC and add constraints. I think your approach is more popular, but in my personal work experience it leads to more fragile systems which is why I advocate for at least understanding why that choice is being made.

EC systems aren’t inherently buggy. They just by themselves don’t include guarantees that you might find useful or desirable for general workloads.


> There are only two hard things in Computer Science: cache invalidation and naming thing.

I've always heard it as "There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors."


Nitpick but I've always preferred to say it as "naming things and cache invalidation" in that order, which IMO emphasizes the irony of creating labels for things and then needing to forget about them.


That's clever!

I think off-by-one errors got a bit easier once we started normalizing loops like `for cat in cats`, but that doesn't really fly when you need to do multiple array accesses while iterating through the loop, so I suppose they're here to stay...


"Cache invalidation, naming things and off-by-one errors. And cache invalidation."


Don't race forget conditions about.


Wow that’s pretty funny. I’m going to use that someday.


The way I see it there are only two ways to perfectly invalidate a cache.

The first is to only cache the results of pure functions whose arguments can be perfectly equality-checked. But if your arguments aren't trivial to compare, this can incur some overhead even on cache hits. It also doesn't work for a stateful system, because not all relevant information gets passed in as arguments.

So the second way is to track every possible mutation to any data in the system, associate each piece of data with all caches that depend on it, and "push" invalidations to those caches when something changes. In the context of a front-end app, this is what MobX does. Salsa is a Rust library that does something similar and is used by rust-analyzer. Broadly this falls under the term "reactivity", though I've seen the word used to describe several different related concepts.

Most of the invalidation strategies people use in practice are imperfect and ad-hoc and depend on domain knowledge. Often you get a version of #2, but done manually instead of automatically: "I know that when X happens, we need to bust cache Y, so I'll manually code that behavior". This is the "hard" version that the adage refers to.


> track every possible mutation to any data in the system, associate each piece of data with all caches that depend on it, and "push" invalidations to those caches when something changes

It's fairly easily done in the backend if you use event-sourcing. All mutations go through explicit events, so you just need an event listener that listens to all relevant events and invalidates the cache.

You can still have a race condition between the event being dispatched and the cache being invalidated, though (or the opposite, depending on when your event listeners are triggered)


Here's my take.

If you could cache all the inputs that go into a pure function, you'd never need to invalidate the cache. Your cache becomes just a lazily-created tabulation of your function - which is fixed, as a map is just another representation of a mathematical function.

So cache invalidation happens only where your cache key isn't containing all the parameters of the cached computation - your cache isn't a mathematical function. Now the problem is, in real world, it's almost impossible to fully capture all the parameters of a computation in a way that gives you a usable cache (one that trades time/CPU for memory). So we need to track the possible changes in the "hidden state", and invalidate a cache entry if we detect such state. That's where all the complexity sits.

For instance, take a problem of a function countWords(file), that counts words in a file. How would we cache its results? If we key by file name, the cache will be wrong when the file gets changed. Keying by the hash of file contents is out of the question, because hashing is at least O(n) with respect to file size, just as countWords(), so we gain nothing. Keying by file name + modification date will fail if the clock changes, or file gets updated faster than the time resolution we're using. Not to mention, we'll have false positives if the date changes but the contents stay the same.

But assuming that the filesystem is sane and nobody messes with the clock, what else could happen? The definition of what a "word" is can change, making the entire cache bad. We may have found a bug in countWords(), updated the code, and now the cached results are also bad. The contents of the cache may have been changed externally (e.g. if shared by multiple functions, or multiple processes using countWords(), with different definition of "word" or with partially deployed bugfix). The code of the cache itself may have been updated to fix a storage bug there. Etc.

At some point you have to decide, which of those issues you care about and which you ignore - and for those you care about, how to monitor for changes, propagate them to the cache, and do it fast enough that none of the cache users will see invalid data in between detection and propagation. That's how cache invalidation is hard.


Every pure function is part of a bigger machine that's eventually mutable state, or by definition that pure function has no reason to exist. Ergo, cache invalidation is hard.


I don't follow. Being part of a bigger machine isn't proof that it's hard.


>Every pure function is part of a bigger machine that's eventually mutable state, or by definition that pure function has no reason to exist

Not necessarily. The universe could be just one huge pure function.


But the universe is constantly changing. So you will need to know the specific bits of the universe that your function depends on or you will never get a hit.


Is the universe changing, or just your references to it?


So the references are mutable. Gotcha.


Aye, functional core, imperative shell (us!).


Ha, I guess it's possible. But if that's true then it seems that some of the intermediate steps in "public static void universe()" use mutable state that can result in cache invalidation difficulties.


Start by doing the easiest thing: nuke all caches whenever anything changes! It's guaranteed to avoid all invalidation edge cases, and it's "good enough" performance-wise for many applications. If and when that proves insufficient, start implementing more fine-grained invalidation. Doing otherwise is just premature optimisation.


You still have a race condition, as long as your backend serves more than one request at the time.

If you nuke the cache _before_ performing the change, another request might make the changed item reappear in the cache before the change is completed.

If you nuke the cache _after_ performing the change, requests might sneak in between the change and the cache nuke and get stale data.


Caches seem to be much easier to handle using the actor model for handling concurrency.

Actors can have their own cache, and since they queue messages and handle them in order, one at a time, the cache will never be out of sync.

It does come at several other costs though.


Actors can indeed make caches super easy.

If your entries are simple enough (add/remove/update) and can be expressed as messages, wiring up a simple pub-sub cache is an afternoon project.


Then you should lock the cache while the change is being performed?


I'm not sure why you got downvoted. A lock could easily solve that problem. Specifically, you could use a read-write lock where readers either keep it open for an entire request or check an epoch number each time they take the lock and abort/restart the request if it changes.


Adding a cache to improve Availability and Partition tolerance while pretending that your (now) distributed system is also Consistent is not "hard", it's impossible. Sorry, CAP applies to caches too.

You inflict this hard problem on yourself when you try to throw away the time context of the data. Just... don't do that. Contextualize data with a commit log sequence id and all of these problems vanish. If the cache has data valid up to sequence id XYZ, include that context in the response -- that's all a cache can say anyways:

    client: hey can I have /foo/bar
    cache: sure, /foo/bar was 'abc' as of #XYZ
    (or)
    client: hey can I have /foo/bar as of at least #XYZ+7
    cache: sure let me look that up...
And thus the cache invalidation problem is reduced to the naming problem, and you have at least one fewer off-by-one errors. Change my mind. :)


Who said caches were there to improve Availability or Partition tolerance in a Consistent system? Inside Consistent systems, caches only exist to make calculations go faster.


The main priblem is that sticking a cache system in front of a database turns it into a homemade distributed database. Often times your use case is simple enough, and your consistency requirements weak enough, that it doesn't cause issues. But these homebrew systems are not made with the rigor and focus of a dedicate distributed database solution and this will eventually bite you, hopefully gently in a way that may make sense to just live with, but sometimes very painfully and that is when you remember that cache invalidation is indeed hard.

It seems that the "correct" solution is building the cache into the database. Allow tying the cached result to the snapshot and inputs that it was generated against and use the regular controls for stale read tolerance to ensure that this cache entry is up-to-date. This even let's you use cache entries within a strongly consistent transaction (although staleness will probably hurt your hit rate)


Everything is cached nowadays: Instruction branches, registers, memory, disk, database query results, proxy responses, DOM, GUI elements. Most data gets cached at multiple levels already.

In any system, wether poorly or well designed, one may discover that hardcoding responses to high-level/networked queries indeed improves performance. So it becomes tempting to duplicate the application response in front of it. Though without much concern wether duplicating the application outside of it is a good idea in that specific case, or not.


Interestingly enough, in my entire career so far, for things that weren't cached that should have been or were already cached and working, the one invalidation point was the save page of the entity. Sometimes it's not that hard for vast majority of cases.


Indeed. I’m not even sure the post was talking about cache invalidation, more like what is the correct abstraction by which you cache something.

I only truly understood cache invalidation problems when I read Instagram’s engineering posts - then it dawned on me how hard it is for them to maintain a cached user profile database that has to spread over the entire world for a billion users (and potentially constantly be invalidated as users from around the world like their post etc).


On the multi-core systems of today, why are we using memory and the cache as a communication device between threads running on different cores? Don't we have the spare transistors and space on the die to implement purpose-built communication hardware between cpu cores which does not fall victim to false sharing, and the like? Isn't the current setup a bit hacky?


Almost all communications between devices is memory mapped i/o these days. x86 has i/o ports, but that's mostly used for devices that were specified long ago (although some things are still relevant).

If there was a core to core bulk messaging system, it would be memory mapped as well, and you'd need to store the messages in some sort of memory, so why add a specialized message queue, when you can just use memory?

You can do things like add a message to a mailbox in memory, and then send an interrupt to the other processor to indicate it's ready.


If there was a core to core bulk messaging system, it would be memory mapped as well, and you'd need to store the messages in some sort of memory, so why add a specialized message queue, when you can just use memory?

So one could code inter-thread communication without having to think about inducing large numbers of cache misses and degrading performance.


I think there are some embedded platforms that used message passing as a primary form of communication. However those are really out there and not at all mainstream. I think what you're talking about is coming. For example, for ARM there is a network on chip architecture that I could see being used as a back bone for this type of system. However, you need software support for it. It's not enough to just build the hardware.


I think this is primarily about backwards compatibility. New CPUs, especially from Intel, were about running existing software faster. The shared memory paradigm goes back to single core (and mainframes) and is pervasive, in code, in languages, in libraries. It's really hard to rewrite all the software on this planet to use some channel mechanism between cores and that also makes it tricky when the OS needs to schedule threads. I think there were some historical examples of this sort of architecture (Transputers?) and generally they didn't fare that well...

If you made a new CPU where there's no shared memory between cores I'm not quite sure who would use it. It's also not that clear how much simpler it makes things on the chip and for the developers.


I think you have to choose between easy and fast in this case. A separate dedicated memory controller that all of the cores are connected to that handles all of the cache would make this problem a lot easier, but would defeat the purpose of the L1 cache.


The question is, why are we dealing with cross-core cache invalidation at all. When core A writes to memory location X, then core B reads from memory location X, why do we insist that core B reads what core A wrote? Almost by definition, the only reason to make such a requirement is that we want to use main memory as a mechanism to communicate between cores.

An alternative architecture would be to say that different cores should not be using the same region of memory. If they attempt to do so, it is an error and exactly what happens is undefined. We only need to worry about cache invalidation when we transfer ownership of a region of memory between cores.

If we had a non memory method of cross core communication, we would need to pass memory between cores far less often, which would greatly simplyfy the problem of memory cache invaldiation.


> When core A writes to memory location X, then core B reads from memory location X, why do we insist that core B reads what core A wrote?

Because that's what the Intel programmer's model of memory specifies. This is not necessarily true on ARM: https://community.arm.com/developer/ip-products/processors/b...

> different cores should not be using the same region of memory

Generally known as "NUMA"; this is a viable programmer's model, but it's different from what people are used to, and requires either software changes or a lot of performance-impacting compatibility layers when software accesses pre-existing global variables.

(You could certainly get NUMA multiprocessing systems back in the day - to make effective use you had to pin processes to cores, because the cost of migrating memory about was considerable otherwise)


> the only reason to make such a requirement is that we want to use main memory as a mechanism to communicate between cores.

For a very liberal definition of "communicate" that statement is true. The only reason for multiple threads to be sharing an address space is if they mutually depend on how the others are interacting with the data in there.

The reason for cross-core cache invalidation is that otherwise it becomes impossible to reason about behavior of any data structures written and read by different threads. Imagine trying to build multiple consumer queue without any cache coherence protocols.


> When core A writes to memory location X, then core B reads from memory location X, why do we insist that core B reads what core A wrote?

Well, on many platforms we don't necessarily require this to be true immediately.


I'd argue the classic memory model is much easier to model in a developer's mind than the alternative.


The Cell Broadband Engine in PlayStation 3 had this kind of message-passing communication between the main PPC core and the SPE cores.


Is this not exactly what AMD's "infinity fabric' does?


The current system is a bit hacky, but we just have no languages that crystalize the use case clearly enough for this to happen at the hardware level.

Were we all programming in Erlang, Rust and Go, maybe we'd see hardware follow.


I've been implementing simple caching for an enterprise system at many bank over the last couple of years. For us it has been very very simple.The system has the following characteristics:

- It's a big system with distributed parts. At any given time any piece of data might in fact be different in different parts of the system (before any change is properly distributed to the entire system).

- Often a single piece of data is accessed a lot in very short succession.

Since the system is kind of already built to handle the fact that the data is only eventually consistent, caching for performance gains have been very simple for us. We just implement a simple cache with a short lifetime. That's it.

Those are two characteristics that I try to look out for. Also, when I build something new, I try to build the system to be able to live in a world where data is not necessarily always the latest (as long as there is some kind of mechanism for getting notified that the data is out of sync when trying to change it, it seems to work).


Cache layers are just (‘just’) distributed databases with narrow requirements. CAP theorem applies, etc. It shouldn’t be a surprise that it’s hard, but due to the aforementioned narrow application space it might be easy to disassociate caches from distributed systems.


Relevant, or at least entertaining: http://thecodelesscode.com/case/220?lang=pt


Cache invalidation isn’t hard, it’s just where hard things show up for easy problems. Cache invalidation in a nutshell:

For some given memory constraint, give me the same result with lookup performance for known inputs.

That’s not hard because of memory constraints (excepting very specialized use cases), that’s hard because the things that benefit from caching usually have implicit inputs. You’re trying to cache “database query” where “database” has other stuff mutating state. You’re not caching the call, you’re caching the side effects it depends on.

This use case is trivially solved if you plan for it. Use the algorithms from a history based system. Caching the state of a given object in git is easy: its cache is its hash. Either that part of the tree changed or it didn’t.

The key is to remove implicit dependencies from your cache path. Everything that you cache states its dependencies, anything that changes in them reports back.

You can get more specific than that for finer grained control but we’re already well past invalidation being the hard problem. It’s just dependency granularity.


Which is great until your backdraft cache invalidation signals start clogging things up because they're happening too often.

Caching isn't always something you can pre-plan for; you're not going to always have a clean, fully encapsulated space like a git database. Quite often the need for caching only comes about as a result of profiling. I've seen plenty of crazy cache systems that turned out to be completely unnecessary and incurred much technical debt in my days because we guessed wrong about the performance profile and expected usage graph at the start.

Sometimes only a subsection of the queries need to be cached, which is important because you may be facing constraints where a full cache would saturate your storage space.

Like in the game of go, it looks simple but the devil's in the details. Caching in the real world is hard, and because of that cache invalidation is also hard.


> Which is great until your backdraft cache invalidation signals start clogging things up because they're happening too often.

There are plenty of simple solutions. Batching, throttling, scheduling, partitioning. Any or several could help.

> you're not going to always have a clean, fully encapsulated space like a git database

This was kinda my point: the problem isn’t cache invalidation, it’s external concurrent state.

This, and the rest of your response, comes down to the hard problems being design, analysis, and ability to adapt. Those problems show up in cache complexity because it reveals where those other considerations are suboptimal.


If I had a dollar for every caching bug I've encountered, I'd be a rich man.


Or maybe, you'd have the same dollar a million times... you never know...


If I actually have the dollar multiple times, we're in nobel prize territory.

If there's one physical dollar that I own multiple times, that basically just means I keep depositing it at the bank and keep being paid with the same dollar. My net worth is still a million, so I'm happy with that.


Ah, the old saying:

* There are only two things that are hard in computer science: cache invalidation, naming things, and off-by-one errors.


my approach to caching

1. cache expensive computation on read.

2. invalidate cache entry on update/delete.

for CRUD like applications it is trivial, since you have separate interfaces for read (GET) and update (POST).


There are many problems that may arise although they not be too important for your use case.

- On write how do you know which entires need to be invalidated? Imagine data that is computed by multi-way joins. Any input that has changed may require invalidating all affected cache entries.

- Your cache is rarely strongly consistent with your database. So your update and invalidation will happen at slightly different times which can cause issues.


thank you for great questions.

Re 1: cannot give you universal answer, but what I use is usually have multi-tiered cache: 1. global cache [using filesystem, or in memory db]

2. cache with scope limited to current user session [using webserver's session mechanism]

add more tiers if your app's architecture needs it (AZ level for example). Also you can use different cache eviction strategies for each cache level (manual eviction/lifetime eviction/LRU/etc).

once you have this, then it becomes easier to figure out what entries at what levels need to be invalidated. Hard thing becomes to maintain this logic, but it is a good trade-off since you get a great performance boost and scalability.

Re 2: UPDATE/DELETE to database usually is couple microseconds on most RDBMSs if using index lookup. cache invalidation is within the same time. In my use cases having this 4-5 ms window when cache entry is outdate is pretty good.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: