Don’t use lru_cache. It stores pointers to the values, so if a mutable value is ...

goodside · on July 9, 2021

This is how I would expect it to work. Caching mutable collections is a legitimate use case, distinct from caching the immutable values those collections contain.

`lru_cache` does not wrap your cached values in container types on its own, so if you’re getting bit by mutable values it’s probably because what you cached was a mutable container (like `list`, `dict`, or `set`). If you use primitive values in the cache, or immutable containers (like `tuple`, `namedtuple`, or `frozenset`), you won’t hit this problem.

If you were to manually cache values yourself using a `dict`, you’d see similar issues with mutability when storing `list` objects as values. It’s not a problem specific to `lru_cache`.

pyuser583 · on July 9, 2021

Can you show me a caching framework/library written in something other than Python that exhibits this behavior?

Edit: Can you show me a caching backend other than local memory (Redis, for example) that implements this behavior?

goodside · on July 10, 2021

This isn’t special behavior that lru_cache “implements”, it’s intrinsic to how mutable contained objects work.

Imagine you `lru_cache` a zero-argument function that sleeps for a long time, then creates a new temp file with random name and returns a writable handle for it. It should be no surprise that a cached version would have to behave differently: Only one random file would ever be made, and repeated calls to the cached function would return the same handle that was opened originally.

If you didn’t expect this, you might complain that writes and seeks suddenly and mysteriously persist across what should be distinct handles, and maybe this `lru_cache` thing isn’t as harmless as advertised. But it’s not the cache’s fault that you cached an object that can mutate state outside the cache, like on a filesystem.

Logically, the behavior remains the same if instead of actual file handles we simulate files in memory, or if we use a mutable list of strings representing lines of data. Generalizing, you could cache any mutable Python collection and see that in-place changes you make to it, much like data written to a file, will still be there when you read the cache the next time.

The reason you don’t see “frameworks” for this is because tracking references to instantiated Python objects outside of the Python process is pointless — objects are garbage-collected and are not guaranteed to stay at the same memory location from one moment to the next. Further, if the lists themselves are small enough to fit in memory, surely there’s no need for out-of-memory scale to cache simple references to those objects.

goodside · on July 10, 2021

Stepping back, I think part of your surprise toward `lru_cache` stems from familiarity with out-of-core or distributed caches, where the immutability of cached values is simply imposed by the architecture. In a distributed cache, modifying the cached value in-place means making another API call, so you can’t mutate the cached object accidentally because you mistook it for a copy.

The only way you can have this confusion is if the actual cached object could be somehow returned. That only happens when everything in the cache is actually just a Python object in your same process.

xavdid · on July 9, 2021

> Don’t use lru_cache

I disagree- it works wonders for expensive functions that return strings and ints. Maybe "don't use lru_cache with mutable values" would be more accurate.

Thanks for noting that edge case though, I've never thought about that.

pyuser583 · on July 9, 2021

This behavior might be justifiable - if it were included in the docs.

If you want to reduce resource consumption, you could zip the results, or deduplicate the values using a separate dict.

There’s nothing unPythonic about a function returning mutable values.

A common use for caching is wrapping a call to a remote servicer. If the response is JSON, it’s likely to be mutable.

Nothing weird or edgy about that.

matsemann · on July 9, 2021

Kinda the same as using mutable values (list and dicts maybe being the most common pitfalls) as default arguments to functions. Shared between all function calls.

zo1 · on July 9, 2021

Not to be mean but you can't expect caching to fix a fundamental design issue in your code.

pyuser583 · on July 9, 2021

Imagine if you created a ticket for a dev to create a cache. The backend is Redis, not local memory.

The dev creates a PR.

The cache is a decorator. It always returns a CachedValue object. The CachedValue object has an API for reading and writing the returned value.

If you write to the CachedValue object it will update Redis with the new, altered value.

It updated Redis not only for the value derived from that key, but all identical values.

This behavior was not included in the ticket. It is not mentioned in the documentation.

It only occurs when the value is mutable.

What do you say to the developer? Job well done? Or what the f*ck is wrong with you?

I’m a nice person so I’d say something nice. I’d ask what they were thinking.

But if their reasons was “this is how caches works” I’d take a serious look at their resume and ask whether they might need some mentoring.

SV_BubbleTime · on July 9, 2021

This seems like it would have been an easy fix on their end.

Was the issue memory and storage?

jakear · on July 9, 2021

It’d be annoying for every type used in a cache to need to properly implement __deepcopy__(), and it’d pose significant performance impact, especially if the cached objects are large (which there’s a good chance they are, given you’ve felt the need to cache them rather than build them from scratch).

Much better off using the same assignment semantics used throughout Python and let people choose to deepcopy() all the objects they read from the cache if they really really want to modify them later; it would even work as a simple decorator to stack onto the existing one for such cases.

Aside: If we’re missing “deep coping” anything by default in Python, it’s definitely default parameters :) deep copying caches makes much less sense than that.

nerdponx · on July 9, 2021

For anyone wondering what exactly the assignment semantics are, see: https://nedbatchelder.com/text/names1.html and https://www.youtube.com/watch?v=_AEJHKGk9ns

tialaramex · on July 9, 2021

> An important fact about assignment: assignment never copies data.

Is that really what's going on here? I'm in way too deep to be sure what's best for beginner programmers, but I feel like Python must surely optimise...

  sheep = 1
  goats = sheep
  sheep = sheep + 10

... by simply copying that 1 into goats, rather than tracking that goats is for now an alias to the same value as sheep and then updating that information when sheep changes on the next line.

Now, if we imagine those numbers are way bigger (say 200 digits), Python still just works, whereas the low level languages I spend most time with will object because 200 digit integers don't fit in a machine word. You could imagine that copying isn't cheaper in that case, but I don't think I buy it. The 200 digit integer is a relatively small data structure, still probably cheaper to copy it than mess about with small objects which need garbage collecting.

goodside · on July 9, 2021

The semantics of assignments in Python are not the same as assignment in C. When you assign a local like `x = some_expression` in Python, you can read it as, “Evaluate `some_expression` now, and call that result `x` in this local namespace.”

The behavior that results from your example follows from this rule. First, evaluate `1` and call it `sheep`. Then evaluate whatever `sheep` is, once, to get `1` (the same object in memory as every other literal `1` in Python) and call it `goats`.

The last line is where the rule matters: The statement `sheep = sheep + 10` can be read as, “Evaluate `sheep + 10` and call the result `sheep`.” The statement reassigns the name `sheep` in the local namespace to point to a different object, one created by evaluating `sheep + 10`. The actual memory location that `sheep` referred to previously (containing the `int` object `1`) is not changed at all — assignment to a local will never change the value of any other local.

This is easy to remember if you recall that a local namespace is effectively just a `dict`. Your example is equivalent to:

    d = {}
    d["sheep"] = 1
    d["goats"] = d["sheep"]
    d["sheep"] = d["sheep"] + 10

It should be clear even to beginners that `d["goats"]` has a final value of `1`, not `11`, because the right-hand side of `d["goats"] = d["sheep"]` is only evaluated once, and at that time it evaluates to `1`. Assignment using locals behaves in exactly the same way.

faho · on July 9, 2021

>Is that really what's going on here? I'm in way too deep to be sure what's best for beginner programmers, but I feel like Python must surely optimise...

For these particular numbers, CPython has one optimization I know of: Small integers (from -5 to 256) are pre-initialized and shared.

See https://docs.python.org/3/c-api/long.html#c.PyLong_FromLong

This is generally invisible to the python programmer, but leads to https://wsvincent.com/python-wat-integer-cache/

tialaramex · on July 9, 2021

Thanks! That's a wild ride.

On my system these "cached" integers seem to each be 32 byte objects, so, over 8kB of RAM is used by CPython to "cache" the integers -5 through 256 in this way.

There's similar craziness over in String town. If I mint the exact same string a dozen times from a constant, those all have the same id, presumably Python has a similar "cache" of such constant strings. But if I assemble the same result string with concatenation, each of the identical strings has a different id (this is in 3.9)

So, my model of what's going on in a Python program was completely wrong. But the simple pedagogic model in the article was also wrong, just not in a way that's going to trip up new Python programmers.

faho · on July 10, 2021

>presumably Python has a similar "cache" of such constant strings.

Not really. You're hitting constant folding: https://arpitbhayani.me/blogs/constant-folding-python

This isn't a pre-made list of certain strings that should be cached, this is the compiler noticing that you mentioned the same constant a bunch of times.

Also in general you would see a lot of things with the same id because python uses references all over the place. E.g. assignment never copies.

hashmush · on July 9, 2021

> by simply copying that 1 into goats

I don't think it does that. You can try it yourself:

    >>> x = 123456
    >>> y = x
    >>> id(x)
    139871830996560
    >>> id(y)
    139871830996560

Quekid5 · on July 9, 2021

Might be interesting as an option for the @lru_cache decorator to be able to specify a function to call before handing a cached value to the user. Then you could just do

    @lru_cache(post=deepcopy, ...)

to have a new (copied) instance in cases where that's required. Or do whatever else you needed. Maybe you happen to know some details about the returnee that would let you get away with copying less and could call my_copy instead.

Something something power of function composition.