It all depend on what analogies we use. If we see the algorithmic compression to...

zarzavat · on June 24, 2023

The difference is whether the use is transformative. In the case of compression, it’s clearly not transformative, because compressing an image just represents it in a different way.

For a search index, it clearly is transformative. A piece of code and a search index are night and day different in every way.

For a neural network it’s tricky and that’s why it’s a gray area. On the one hand, a neural network looks transformative because with a neural network I can do many different things that don’t involve any verbatim copying of the original work. If I ask ChatGPT to “write me a haiku about fishing on Mars” it’s not like it’s trawling through a database of copyrighted haikus and copying one someone already wrote about fishing on Mars. On the other hand generative NNs do sometimes spit out copyrighted works verbatim, which does show that there are pieces of copyrighted works inside - but that doesn’t mean that the whole thing is automatically infringing, for example courts could decide that just particular outputs are infringing whereas the weights and other outputs are not.

cornholio · on June 24, 2023

You are perhaps conflating the model with its output. The output is the result of a human initiated action (for example, a prompt), that can result in anything ranging from a completely new work, not resembling any in the training set, up to a verbatim reproduction of a training work. Depending on the specific circumstances, that output might be a sufficiently transformative derivation, an infringing copy, or a non-derivative, fully independent work.

The model itself however is always a derivative work, it's an algorithmic representation of the training set, so it must abide by the license terms of that material.

For example, a karaoke machine might include public domain songs and you could use it legally for public performances of those works. But if the machine also includes unlicensed copyrighted songs, then the machine maker is guilty of copyright infringement for those tracks, even if a buyer of the machine can choose a non-infringing work. The ability to produce infringing works is sufficient to taint is as a whole, even if some user might not like those tracks and prefer the public domain tracks.

In the same way, an AI machine is tainted by unlicensed training data, even if it can be used in a non-infringing manner; the owner of the machine cannot operate it and offer its services with disregard to the ownership of the source material on which his machine is in fact based on. Conversely, even if some holder might grant the AI shop a license to use their material for training, that does not also automatically grant the users of the AI tool a license to create derivative works of those originals.

lucubratory · on June 25, 2023

I don't agree that a model is a derivative work, and I think a judge would likely agree with me. I think you need to be able to show those major copyrightable elements of the original work are actually present in the allegedly derivative work, something that is very non-trivial with even the most transparent of models like Stable Diffusion - scientists doing intensive analysis of the SD model were only able to find around a hundred instances of reproduced images from the source material out of several hundred thousand attempts.

That said, it definitely would be copyright infringement to download a bunch of copyrighted material and actually use it in some way, for example to train a model. Luckily, in most jurisdictions it is recognised that this is the case and so governments have specifically carved out exceptions to copyright law for this process (known as text and data mining or TDM). This includes the UK, the EU, Japan, and China. In the US, there is no specific law addressing the issue yet, but many companies are doing it in the US (and have been doing it for many years) with the presumption of legality based on the Google v Author's Guild and Google v Perfect 10 rulings. Basically, they are acting under the assumption that it is fair use, which I think is a ~reasonable assumption and I think would be held up by the US Supreme Court if they wanted to take it.

williamcotton · on June 25, 2023

It’s more like a VCR than a karaoke machine in my opinion.

Like a VCR it can be used for copyright infringement.

Unlike a karaoke machine that has copyrighted material the entire point of purchasing the karaoke machine is to get copyrighted songs to sing.

The VCR, like the LLM, has other uses, and if someone distributes infringing VCR cassettes they are liable and the manufacturer is not.

VCRs being applicable because of Sony v Universal.

https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....

The Court also ruled that the manufacturers of home video recording devices, such as Betamax or other VCRs (referred to as VTRs in the case), cannot be liable for contributory infringement.

Also, it can be argued that the LLM model is significantly transformative.

https://en.wikipedia.org/wiki/Transformative_use

In United States copyright law, transformative use or transformation is a type of fair use that builds on a copyrighted work in a different manner or for a different purpose from the original, and thus does not infringe its holder's copyright.

In computer- and Internet-related works, the transformative characteristic of the later work is often that it provides the public with a benefit not previously available to it, which would otherwise remain unavailable.

Which is what was argued in Perfect 10 v Google.

https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com....

Specifically, the court ruled that Google transformed the images from a use of entertainment and artistic expression to one of retrieving information, citing the precedent Kelly v. Arriba Soft Corporation. The court reached this conclusion despite the fact that Perfect 10 was attempting to market thumbnail images for cell phones, with the court quipping that the "potential harm to Perfect 10's market remains hypothetical."

The court pointed out that Google made available to the public the new and highly beneficial function of "improving access to [pictorial] information on the Internet." This had the effect of recognizing that "search engine technology provides an astoundingly valuable public benefit, which should not be jeopardized just because it might be used in a way that could affect somebody's sales."

So an LLM is a tool that could be used to produce infringing material, like a VCR, and an LLM is a tool that does indeed copy infringing material, like Google Image search, but applies computationally expensive transformations used to generate newly existing functionality that has a valuable public benefit distinct from creating and distributing the original infringing copies. That new, visually non-infringing images could compete with the originals in a market is not the intent nor spirit of the clause related to market impact as that would imply that all paintings of a red circle have a market impact on all paintings of a blue circle where clearly the intent of copyright is to protect a concrete and subjective expression whose market value differs from image to image.

belorn · on June 24, 2023

A fun experiment is to take a 4k video and convert it to 144p, and then use an AI upscale back to 4k. The result is quite odd, but still very much recognizable of the original video, but with a lot of artifacts and hallucinations.

In some ways it is very transformative. We can easily identify the original from the new work, and the new work will have features and aspects which the original don't. From a fair use perspective the big question is if we want commercial competition between them. I suspect the answer would be no.

We could see courts decide that particular outputs are infringing. This was the defense used by the piratebay founders. There were Linux distros on the website, and users had the choice over what they downloaded. I would expect many more lawsuits if the courts came to that decision.

datavirtue · on June 25, 2023

Software is data. At best the copyright applies to the entire work. Snippets are just data.