It all depend on what analogies we use. If we see the algorithmic compression to be similar to converting a 4k video to a lower resolution, the legal system seems to view it as a copy despite it being a lossy compression.
If we take the input data of a average website and look at the data inside a search engine indexing, it will likely contain more bits from the original than converting a 4k video down to a 144p, youtubes smallest video format. We do however view the index to be fair use while the 144p video to be similar enough to the original to be considered a copy.
Those kinds of discussion always reminds me of early discussions around freenet. A file get encrypted and then split into 32KiB files. Multiple files can share identical 32KiB blocks, which means no single block can be definitively owned by a single file. The argument was then that this bypassed copyright law, since just copying blocks would not be proof of copying. This question is also unresolved, but given the outcome of all file sharing sites in the past, it is doubtful that it would succeed in convincing a judge.
In the end that is what this is coming down to. What would a judge/jury say. All I know for certain is that the film and music industry will never accept an model that is trained on their products and that directly competes with their products by producing substitutes that are close or seemingly identical to the originals. They will not care a second if its similar to a search engine indexing. Unstable Diffusion is also a perfect example where politicians will likely do something if large companies start to generate money by producing porn that is trained on famous politicians, actors and celebrates.
The difference is whether the use is transformative. In the case of compression, it’s clearly not transformative, because compressing an image just represents it in a different way.
For a search index, it clearly is transformative. A piece of code and a search index are night and day different in every way.
For a neural network it’s tricky and that’s why it’s a gray area. On the one hand, a neural network looks transformative because with a neural network I can do many different things that don’t involve any verbatim copying of the original work. If I ask ChatGPT to “write me a haiku about fishing on Mars” it’s not like it’s trawling through a database of copyrighted haikus and copying one someone already wrote about fishing on Mars. On the other hand generative NNs do sometimes spit out copyrighted works verbatim, which does show that there are pieces of copyrighted works inside - but that doesn’t mean that the whole thing is automatically infringing, for example courts could decide that just particular outputs are infringing whereas the weights and other outputs are not.
You are perhaps conflating the model with its output. The output is the result of a human initiated action (for example, a prompt), that can result in anything ranging from a completely new work, not resembling any in the training set, up to a verbatim reproduction of a training work. Depending on the specific circumstances, that output might be a sufficiently transformative derivation, an infringing copy, or a non-derivative, fully independent work.
The model itself however is always a derivative work, it's an algorithmic representation of the training set, so it must abide by the license terms of that material.
For example, a karaoke machine might include public domain songs and you could use it legally for public performances of those works. But if the machine also includes unlicensed copyrighted songs, then the machine maker is guilty of copyright infringement for those tracks, even if a buyer of the machine can choose a non-infringing work. The ability to produce infringing works is sufficient to taint is as a whole, even if some user might not like those tracks and prefer the public domain tracks.
In the same way, an AI machine is tainted by unlicensed training data, even if it can be used in a non-infringing manner; the owner of the machine cannot operate it and offer its services with disregard to the ownership of the source material on which his machine is in fact based on. Conversely, even if some holder might grant the AI shop a license to use their material for training, that does not also automatically grant the users of the AI tool a license to create derivative works of those originals.
I don't agree that a model is a derivative work, and I think a judge would likely agree with me. I think you need to be able to show those major copyrightable elements of the original work are actually present in the allegedly derivative work, something that is very non-trivial with even the most transparent of models like Stable Diffusion - scientists doing intensive analysis of the SD model were only able to find around a hundred instances of reproduced images from the source material out of several hundred thousand attempts.
That said, it definitely would be copyright infringement to download a bunch of copyrighted material and actually use it in some way, for example to train a model. Luckily, in most jurisdictions it is recognised that this is the case and so governments have specifically carved out exceptions to copyright law for this process (known as text and data mining or TDM). This includes the UK, the EU, Japan, and China. In the US, there is no specific law addressing the issue yet, but many companies are doing it in the US (and have been doing it for many years) with the presumption of legality based on the Google v Author's Guild and Google v Perfect 10 rulings. Basically, they are acting under the assumption that it is fair use, which I think is a ~reasonable assumption and I think would be held up by the US Supreme Court if they wanted to take it.
The Court also ruled that the manufacturers of home video recording devices, such as Betamax or other VCRs (referred to as VTRs in the case), cannot be liable for contributory infringement.
Also, it can be argued that the LLM model is significantly transformative.
In United States copyright law, transformative use or transformation is a type of fair use that builds on a copyrighted work in a different manner or for a different purpose from the original, and thus does not infringe its holder's copyright.
In computer- and Internet-related works, the transformative characteristic of the later work is often that it provides the public with a benefit not previously available to it, which would otherwise remain unavailable.
Specifically, the court ruled that Google transformed the images from a use of entertainment and artistic expression to one of retrieving information, citing the precedent Kelly v. Arriba Soft Corporation. The court reached this conclusion despite the fact that Perfect 10 was attempting to market thumbnail images for cell phones, with the court quipping that the "potential harm to Perfect 10's market remains hypothetical."
The court pointed out that Google made available to the public the new and highly beneficial function of "improving access to [pictorial] information on the Internet." This had the effect of recognizing that "search engine technology provides an astoundingly valuable public benefit, which should not be jeopardized just because it might be used in a way that could affect somebody's sales."
So an LLM is a tool that could be used to produce infringing material, like a VCR, and an LLM is a tool that does indeed copy infringing material, like Google Image search, but applies computationally expensive transformations used to generate newly existing functionality that has a valuable public benefit distinct from creating and distributing the original infringing copies. That new, visually non-infringing images could compete with the originals in a market is not the intent nor spirit of the clause related to market impact as that would imply that all paintings of a red circle have a market impact on all paintings of a blue circle where clearly the intent of copyright is to protect a concrete and subjective expression whose market value differs from image to image.
A fun experiment is to take a 4k video and convert it to 144p, and then use an AI upscale back to 4k. The result is quite odd, but still very much recognizable of the original video, but with a lot of artifacts and hallucinations.
In some ways it is very transformative. We can easily identify the original from the new work, and the new work will have features and aspects which the original don't. From a fair use perspective the big question is if we want commercial competition between them. I suspect the answer would be no.
We could see courts decide that particular outputs are infringing. This was the defense used by the piratebay founders. There were Linux distros on the website, and users had the choice over what they downloaded. I would expect many more lawsuits if the courts came to that decision.
If we take the input data of a average website and look at the data inside a search engine indexing, it will likely contain more bits from the original than converting a 4k video down to a 144p, youtubes smallest video format. We do however view the index to be fair use while the 144p video to be similar enough to the original to be considered a copy.
Those kinds of discussion always reminds me of early discussions around freenet. A file get encrypted and then split into 32KiB files. Multiple files can share identical 32KiB blocks, which means no single block can be definitively owned by a single file. The argument was then that this bypassed copyright law, since just copying blocks would not be proof of copying. This question is also unresolved, but given the outcome of all file sharing sites in the past, it is doubtful that it would succeed in convincing a judge.
In the end that is what this is coming down to. What would a judge/jury say. All I know for certain is that the film and music industry will never accept an model that is trained on their products and that directly competes with their products by producing substitutes that are close or seemingly identical to the originals. They will not care a second if its similar to a search engine indexing. Unstable Diffusion is also a perfect example where politicians will likely do something if large companies start to generate money by producing porn that is trained on famous politicians, actors and celebrates.