I actually once tracked this claim down in the case of stable diffusion. I concl...

pclmulqdq · on July 11, 2024

I think this argument starts to break down for the (gigantic) GPTs where the model size is a lot closer to the size of the training corpus.

Thinking in terms of compression, the compression in generative AI models is lossy. The mathematical bounds on compression only apply to lossless compression. Keeping in mind that a small fraction of the training corpus is presented to the training algorithm multiple times, it's not absurd to suggest that these works exist inside the algorithm in a recallable form. Hence the NYT's lawyers being able to write prompts that recall large chunks of NYT articles verbatim.

Kim_Bruning · on July 11, 2024

Well, certainly up to GPT-3 that would seem a little odd. Models of somewhat similar capability are not THAT big, really. Eg:

  $ ollama list                            
  NAME                    ID              SIZE    MODIFIED     
  yi:34b                  ff94bc7c1b7a    19 GB   7 days ago  
  mistral:latest          61e88e884507    4.1 GB  2 months ago
  mixtral:8x22b           bf88270436ed    79 GB   2 months ago
  llama3:70b              be39eb53a197    39 GB   2 months ago
  phi3:latest             a2c89ceaed85    2.3 GB  2 months ago
  dolphin-mistral:latest  5dc8c5a2be65    4.1 GB  2 months ago
  yarn-mistral:7b-128k    6511b83c33d5    4.1 GB  2 months ago
  yarn-mistral:latest     8e9c368a0ae4    4.1 GB  2 months ago
  llama3:latest           a6990ed6be41    4.7 GB  2 months ago

For comparison, here's some stable diffusion checkpoints.

  ComfyUI/models/checkpoints $ du -h .
  6.5G    breakdomainxl_v03d.safetensors
  6.5G    dreamshaperXL10_alpha2Xl10.safetensors 
  6.5G    sd_xl_base_1.0.safetensors
  5.7G    sd_xl_refiner_1.0.safetensors
  ...

And I seem to recall there are some theoretical lower bounds on even lossy compression. Some quick back of the envelope fermi estimation gets me a hard lower bound of 5TB for "all the images on the internet"; but I'm not quite confident enough in my math to quite back that up right here and now.

pclmulqdq · on July 11, 2024

> And I seem to recall there are some theoretical lower bounds on even lossy compression.

I'm not sure what your math is coming from and it seems trivially wrong. A single black pixel is a very lossy compression of every image on the internet. A picture of the Facebook logo is a slightly-less-lossy compression of every picture on the internet (the Facebook logo shows up on a lot of websites). I would believe that you can get a bound on lossy compression of a given quality (whatever quality means) only if you assume that there is some balance of the images in the compressed representation. There are a lot of assumptions there, and we know for a fact that the text fed to the GPTs to train them was presented in an unbalanced way.

In fact, if you look at the paper "textbooks are all you need" (https://arxiv.org/pdf/2306.11644) you can see that presenting a very limited set of information to an LLM gets a decent result. The remaining 6 trillion tokens in the training set are sort of icing on the cake.

Kim_Bruning · on July 11, 2024

Ok, that's a really low lower bound.

I think you'll agree that it would be a bit absurd to threaten legal action against someone for storing a single black pixel.

OTOH Someone might be tempted to start a lawsuit if they believe their image is somehow actually stored in a particular data file.

For this to be a viable class action lawsuit to pursue, I think you'd have to subscribe to the belief that it's a form of compression where if you store n images, you're also able to get n images back. Else very few people would have actual standing to sue.

pclmulqdq · on July 11, 2024

I think that when you speak in terms of images, for a viable lawsuit, you need to have a form of compression that can recall n (n >= 1) images from compressing m (m >= n) images. Presumably n is very large for LLMs or image models, even though m is orders of magnitude larger. I do not think that your form of compression needs to be able to get all m images back. By forcing m = n in your argument, you are forcing some idea of uniformity of treatment in the compression, which we know is not the case.

The black pixel won't get you sued, but the Facebook logo example I used could get you sued. Specifically by Facebook. There is an image (n = 1) that is substantially similar to the output of your compression algorithm.

That is sort of what Getty's lawsuit alleges. Not that every picture is recallable from an LLM, but that several images that are substantially similar to Getty's images are recallable. The same goes with the NYT's lawsuit and OpenAI.

Kim_Bruning · on July 11, 2024

Thank you for talking with me!

I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .

I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).

While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .

And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]

Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.

That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)

* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.

* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)

[1] However, in practice there appear to be limits to floating point precision.

[2] I'm using >> as "much greater than"