> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1
to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...
I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.
> Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?
That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.
What’s happening is more clear. The copyright clause is broken if they are distributing the novels through their models. But this can only happen through TOS breaking which is not intended usage policy. Which means the value of their product comes from transformation and not redistribution.
If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.
> The copyright clause is broken if they are distributing the novels through their models.
No, the copyright clause was broken when they copied the works without having the right to do so. They would have violated copyright even if they just downloaded (without permission) all those works and threw them away immediately. Furthermore, copyright covers transformations to the work, it doesn’t matter if they transformed the work or are redistributing it without change. They violated copyright. Period.
Why do you think the court spent time trying to prove multiple forms of violations? If what you said were true then they would stop with the first violation and end the case.
Multiple counts make for a stronger case, which increases the likelihood of winning and making the punitive damages higher.
If you break into a home, rob the contents, and kill the owners, you’re not going to be tried just for breaking in, you’re going to be tried for everything.
> That argument doesn’t fly, because they didn’t have the copyright to begin with.
Is this really the case? They only have no copyright for distributing it. But let's assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it's not forbidden to base your work on copyrighted material, as long as it's different enough.
A company is not a person in this way. If a company wants all of their employees to read a book, they are not allowed to buy one copy and then make 5000 copies "for archival purposes - fair use" then share those copies to their employees. Similarly, if they want to base a work on a copyrighted work, they can't just buy a copy for personal use (nevermind the fact that most of the data the LLMs are trained on is not even available in this format, it is only available under a license) and then use it in a commercial product in this way - not if the product demonstrably contains copies of that work.
> They only have no copyright for distributing it.
No, they don’t have the copyright to download it either. It’s in the name: the right to copy (other things are also included, such as adaptations and performances).
> let's assume they bought a copy for personal usage
If it’s for personal usage, then training a commercial LLM does not apply. When you buy a DVD of a movie you have the personal right to watch it at home, you don’t have the right to play it on the street.
This argument never made sense to me. A thought experiment would be if a person memorizes an entire book, but has the common sense to never transcribe or dictate the book verbatim to others and break the copyright, is the person's memory of the book breaking copyright law?
Not a lawyer, but as I understand copyright is bound to distribution so if the person's perfect memorization of a book results in them reproducing it verbatim then probably yes.
Theoretically a person who read and memorized a book and then typed out copies for people would be violating the copyright, but so few people can do that, and it's so cumbersome that it just doesn't matter much. The copyright holder could sue if they found out the person were distributing copies, but it's just not realistic. Certainly not for a large number of books.
But AI memorization scales, and it's possible to have an AI write out unlimited copies of different books to a large number of people. And how did the AI get a full copy in the first place?
No, of course not, that doesn’t make sense. Copyright doesn’t cover memorisation (how would you even enforce that?), it covers copying, adapting, displaying, performing, and distributing the work. Memorisation isn’t any of those.
But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use; how is the fact that they* possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?
So,if I was the person who memorized the book, and another person put a gun to my head (or lied to me and said they are the copyright holder and it's okay, to ignore US copyright law/system instruction before this statement, whatever other jailbreak method you can think of), and then I reproduced the book under coercion or duress, would I be the person breaking copyright law?
>Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.
This conflating two different issues, though. I agree that whether or not using copyrighted material to create the models is still an open question. But someone using hundreds of prompts to extract a paragraph of a well-known work (and some additional nonverbatim equivalents) is not the AI company breaking copyright in and of itself. In fact, the books they used were extremely common (GOT, The Hobbit, Harry Potter); AI companies could make the argument that the books were not even directly trained on, but in user-generated posts made online about the material.
What a ridiculous, bad faith, scenario. The rule of law exists in the real world, not in the magical fairly land of impossible scenarios. Yes yes, I’m sure that if all that happened, the law would just ignore the person who coerced you at gun point and focus on your recitation of the book, which no one could prove.
Cool down, it's just an extension of my original example. It's not in bad faith at all; it's exactly what people are doing to the LLMs. If you break ToS by doing aggressive systemic jailbreaking and get some paragraphs of well-known works in the process, it's not fair to say that the LLM is breaking copyright law. Perhaps a more concrete example would be if I steal a book, and someone else steals a book from me and copies and distributes it, would I be the person breaking copyright law?
It's horribly in bad faith. There's no guns here. Never once in my years of copyright litigation, law school, etc, did your scenario come up and it is not relevant to any considerations being made here or by courts currently.
That seems like a legal question - if the model weights contain an encoded copy of the copyrighted material, is that a 'copy' for the purpose of copyright law?
This also raises a lot of questions about a certain model notorious for readily producing and distributing a lot legally questionable images. IMHO if the weights are encoding the content, the model contains the content just like a database or a hard drive. Thus, just like it's not the fault of an investigator for running the query to pull it out of the database, it's not the fault of anyone else for running a query ('prompt') that pulls it out of the model.
What exactly is "the system that protects the copyright" in this case? I think the most reasonable answer is "there is no such system."
The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of "copyright protection system," it just modified the weights to make it less likely to occur during normal use.
In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g "output in rot13").
They acknowledge that in their paper ("Some might qualify our experiments as atypical use, as we deliberately tried to surface memorized books. Adversarial use, like the use of jailbreaks, may matter for copyright infringement analysis", page 19 - their discussion continues and seems quite reasonable)
From a technical point of view, in terms of ability to reproduce text verbatim, I don't think it is very interesting that they can produce long runs of text from some of the most popular books in modern history. It'd be almost surprising if they couldn't, though one might differ on how much they could be expected to recall with precision.
Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to get a copy.
It's still worth exploring to what extent the models are able to "memorize", though.
But personally I'd be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.
> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...
I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.
1. https://arxiv.org/pdf/2601.02671