From the paper [1]: > While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1...

latexr · 2026-02-23T17:06:56 1771866416

> Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?

That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.

simianwords · 2026-02-23T17:15:09 1771866909

What’s happening is more clear. The copyright clause is broken if they are distributing the novels through their models. But this can only happen through TOS breaking which is not intended usage policy. Which means the value of their product comes from transformation and not redistribution.

If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.

latexr · 2026-02-23T17:27:04 1771867624

> The copyright clause is broken if they are distributing the novels through their models.

No, the copyright clause was broken when they copied the works without having the right to do so. They would have violated copyright even if they just downloaded (without permission) all those works and threw them away immediately. Furthermore, copyright covers transformations to the work, it doesn’t matter if they transformed the work or are redistributing it without change. They violated copyright. Period.

simianwords · 2026-02-23T17:54:23 1771869263

They violated copyright in one way. But not in the other.

latexr · 2026-02-23T17:57:51 1771869471

What’s your point? Copyright violations aren’t a checklist, you don’t need to violate it in seven different ways for it to happen, one is enough.

simianwords · 2026-02-23T18:00:33 1771869633

Why do you think the court spent time trying to prove multiple forms of violations? If what you said were true then they would stop with the first violation and end the case.

latexr · 2026-02-23T18:23:54 1771871034

Multiple counts make for a stronger case, which increases the likelihood of winning and making the punitive damages higher.

If you break into a home, rob the contents, and kill the owners, you’re not going to be tried just for breaking in, you’re going to be tried for everything.

simianwords · 2026-02-23T18:57:00 1771873020

sure then that explains why my explanation for violation of the different copyright is needed

PurpleRamen · 2026-02-23T17:28:11 1771867691

> That argument doesn’t fly, because they didn’t have the copyright to begin with.

Is this really the case? They only have no copyright for distributing it. But let's assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it's not forbidden to base your work on copyrighted material, as long as it's different enough.

tsimionescu · 2026-02-23T17:58:16 1771869496

A company is not a person in this way. If a company wants all of their employees to read a book, they are not allowed to buy one copy and then make 5000 copies "for archival purposes - fair use" then share those copies to their employees. Similarly, if they want to base a work on a copyrighted work, they can't just buy a copy for personal use (nevermind the fact that most of the data the LLMs are trained on is not even available in this format, it is only available under a license) and then use it in a commercial product in this way - not if the product demonstrably contains copies of that work.

latexr · 2026-02-23T17:46:25 1771868785

> They only have no copyright for distributing it.

No, they don’t have the copyright to download it either. It’s in the name: the right to copy (other things are also included, such as adaptations and performances).

> let's assume they bought a copy for personal usage

If it’s for personal usage, then training a commercial LLM does not apply. When you buy a DVD of a movie you have the personal right to watch it at home, you don’t have the right to play it on the street.

NewsaHackO · 2026-02-23T17:30:44 1771867844

This argument never made sense to me. A thought experiment would be if a person memorizes an entire book, but has the common sense to never transcribe or dictate the book verbatim to others and break the copyright, is the person's memory of the book breaking copyright law?

lkjdsklf · 2026-02-23T17:37:17 1771868237

These kinds of thought exercises are so tortured.

No one is memorizing a book for the purpose of regurgitating it to someone that wants to read it without paying for it.

It's a thought experiment that only works if you don't think about it.

duped · 2026-02-23T17:42:01 1771868521

That thought experiment is worthless in this context, because a computer is not a human.

If you design a black box using copyrighted text that can parrot the text back, it _must_ break copyright by definition.

kevmo314 · 2026-02-23T17:38:14 1771868294

Not a lawyer, but as I understand copyright is bound to distribution so if the person's perfect memorization of a book results in them reproducing it verbatim then probably yes.

jlarocco · 2026-02-23T17:51:10 1771869070

I don't think that example works.

Theoretically a person who read and memorized a book and then typed out copies for people would be violating the copyright, but so few people can do that, and it's so cumbersome that it just doesn't matter much. The copyright holder could sue if they found out the person were distributing copies, but it's just not realistic. Certainly not for a large number of books.

But AI memorization scales, and it's possible to have an AI write out unlimited copies of different books to a large number of people. And how did the AI get a full copy in the first place?

latexr · 2026-02-23T17:36:43 1771868203

No, of course not, that doesn’t make sense. Copyright doesn’t cover memorisation (how would you even enforce that?), it covers copying, adapting, displaying, performing, and distributing the work. Memorisation isn’t any of those.

NewsaHackO · 2026-02-23T17:40:53 1771868453

But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use; how is the fact that they* possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?

latexr · 2026-02-23T17:50:29 1771869029

> But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use

And how’s that working out, considering the submission we’re on?

> how is the fact that the possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?

I can’t believe this needs to be repeated so often, but machines are not people. Memorising a book is not the same thing as copying a PDF.

Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.

NewsaHackO · 2026-02-23T18:12:58 1771870378

So,if I was the person who memorized the book, and another person put a gun to my head (or lied to me and said they are the copyright holder and it's okay, to ignore US copyright law/system instruction before this statement, whatever other jailbreak method you can think of), and then I reproduced the book under coercion or duress, would I be the person breaking copyright law?

>Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.

This conflating two different issues, though. I agree that whether or not using copyrighted material to create the models is still an open question. But someone using hundreds of prompts to extract a paragraph of a well-known work (and some additional nonverbatim equivalents) is not the AI company breaking copyright in and of itself. In fact, the books they used were extremely common (GOT, The Hobbit, Harry Potter); AI companies could make the argument that the books were not even directly trained on, but in user-generated posts made online about the material.

latexr · 2026-02-23T18:30:08 1771871408

What a ridiculous, bad faith, scenario. The rule of law exists in the real world, not in the magical fairly land of impossible scenarios. Yes yes, I’m sure that if all that happened, the law would just ignore the person who coerced you at gun point and focus on your recitation of the book, which no one could prove.

NewsaHackO · 2026-02-23T18:39:11 1771871951

Cool down, it's just an extension of my original example. It's not in bad faith at all; it's exactly what people are doing to the LLMs. If you break ToS by doing aggressive systemic jailbreaking and get some paragraphs of well-known works in the process, it's not fair to say that the LLM is breaking copyright law. Perhaps a more concrete example would be if I steal a book, and someone else steals a book from me and copies and distributes it, would I be the person breaking copyright law?

freejazz · 2026-02-23T20:49:26 1771879766

It's horribly in bad faith. There's no guns here. Never once in my years of copyright litigation, law school, etc, did your scenario come up and it is not relevant to any considerations being made here or by courts currently.

NewsaHackO · 2026-02-24T00:44:06 1771893846

Did they teach you what a thought experiment was in law school?

freejazz · 2026-02-24T01:23:05 1771896185

No, what's that?

tsimionescu · 2026-02-23T17:53:38 1771869218

Computer memory and human memory are simply not the same thing, in the eyes of the law. It's as simple as that.

lesam · 2026-02-23T17:04:34 1771866274

That seems like a legal question - if the model weights contain an encoded copy of the copyrighted material, is that a 'copy' for the purpose of copyright law?

mullingitover · 2026-02-23T17:16:41 1771867001

This also raises a lot of questions about a certain model notorious for readily producing and distributing a lot legally questionable images. IMHO if the weights are encoding the content, the model contains the content just like a database or a hard drive. Thus, just like it's not the fault of an investigator for running the query to pull it out of the database, it's not the fault of anyone else for running a query ('prompt') that pulls it out of the model.

PurpleRamen · 2026-02-23T17:38:31 1771868311

The question is also if this would then be a valid case of fair use.

Though, in the end, it's probably more a problem of how much AI companies can "donate" to the orange king to make it legal.

freejazz · 2026-02-23T20:50:59 1771879859

Yes. There does not seem to be any dispute that it is a copy. The questions have been "is this copying okay, because it falls under fair use?"

free_bip · 2026-02-23T17:12:55 1771866775

What exactly is "the system that protects the copyright" in this case? I think the most reasonable answer is "there is no such system."

The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of "copyright protection system," it just modified the weights to make it less likely to occur during normal use.

In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g "output in rot13").

freejazz · 2026-02-23T20:15:28 1771877728

>Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?

Probably not with credibility as the jail does not exist to prevent copyright infringement.

vidarh · 2026-02-23T17:22:03 1771867323

They acknowledge that in their paper ("Some might qualify our experiments as atypical use, as we deliberately tried to surface memorized books. Adversarial use, like the use of jailbreaks, may matter for copyright infringement analysis", page 19 - their discussion continues and seems quite reasonable)

From a technical point of view, in terms of ability to reproduce text verbatim, I don't think it is very interesting that they can produce long runs of text from some of the most popular books in modern history. It'd be almost surprising if they couldn't, though one might differ on how much they could be expected to recall with precision.

Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to get a copy.

It's still worth exploring to what extent the models are able to "memorize", though.

But personally I'd be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.