> Many licenses are reasonably clear that this kind of use is not acceptable, as...

mindcrime · on June 24, 2023

It's also a license violation to train an AI on Open Source code, generate "new" code from that model even if it's not an exact copy, and ignore the licenses of the input.

That's not exactly a given that we can simply take as true. Of course that's borderline a trite tautology about any legal issue, but I'd argue that this is even fuzzier than usual. If a human writes some code, after having seen a given corpus of code previously, the "new" code might or might not be a derivative work of that corpus. It's not clear that replacing the human with an AI somehow changes the equation so categorically that it becomes automatic to consider the output of the AI a derivative work.

License violations don't suddenly become acceptable just because you're violating a million licenses at once.

No, but if either a human or an AI emits a given line of code, and that line of code can't be shown to have been cribbed from some corpus of existing code, or to be substantially similar to such, then why wouldn't it be considered original work in both cases?

JoshTriplett · on June 24, 2023

> It's not clear that replacing the human with an AI somehow changes the equation so categorically that it becomes automatic to consider the output of the AI a derivative work.

See below: there are good reasons for an AI LLM to be considered categorically different than a human for copyright purposes.

> No, but if either a human or an AI emits a given line of code, and that line of code can't be shown to have been cribbed from some corpus of existing code, or to be substantially similar to such, then why wouldn't it be considered original work in both cases?

For a work produced by a human, the burden of proof is on someone claiming that the work is a derivative work of something the human read. And in general, humans without photographic memories or a specific work open in front of them don't tend to have the ability to produce any works verbatim, though some might be able to produce sufficiently similar works to raise questions of whether they're derived works. There's also a certain unstated presumption that human learning (as opposed to human memorization or copying) doesn't constitute a derivative work, and relatedly, that a human brain isn't copyrightable so it can't be a derivative work of anything. That unstated presumption likely also touches on unstated core values about human brains, creativity, and the obvious fact that everything a human does (including the creation of creative works) is based on that human's experiences. If you write a book, you've learned from all the books you've read, but that doesn't make your book a derivative work of every book you have ever read; if people saw that outcome, they'd consider copyright law incorrect rather than accepting it.

An AI LLM, on the other hand, is (unless some court or law changes this) a derivative work of its training data. If you take off any after-the-fact filters for "don't generate a copy of any of the training data", an AI LLM can easily recite its training data, providing further evidence that the AI LLM is a derivative work of that data. The burden of proof is easily met. An AI LLM does have a photographic memory. An AI LLM hasn't just learned ideas about what makes a good book, it has learned the complete text of an extensive number of books. And there's no particular reason for us to have any of the same values about human learning apply to an AI LLM, not least of which because an AI LLM is in fact copyrightable and self-evidently a derivative work.

mindcrime · on June 24, 2023

An AI LLM, on the other hand, is (unless some court or law changes this) a derivative work of its training data.

because an AI LLM is in fact copyrightable and self-evidently a derivative work.

I mean, that's a fine opinion to hold, and you might be right. But so far all you've done is repeat yourself and appeal to "self-evident" which isn't a terribly strong argument.

I'll wait for some actual precedent / case-law to solidify my own opinion. As it stands, I can see both sides of the argument, but I don't think the conclusion is as obvious as some folks in this discussion seem to find it. shrug

an AI LLM can easily recite its training data, providing further evidence that the AI LLM is a derivative work of that data.

OK, I can buy that to a point, so far as arguing that the LLM itself is a derivative work. But I'm not convinced that, in turn, the output of the LLM is also a derivative work in those cases where what it returns is not an exact copy (or even nearly exact copy) of anything in the training corpus.

JoshTriplett · on June 24, 2023

Clarifying: the part I'm arguing is "self-evident" is that an LLM is a derivative work of its training data, in the same sense that if you copy the text of a million books into a data file and compress that file reversibly in a way that lets you get most or all of them back out again, the result is clearly a derivative work of those books. That part I made a case for, and it seems like from your last paragraph you agree with that part of the argument.

(By contrast, I wouldn't say it's self-evident that a database of blake3 hashes would be a derivative work (leaving aside that it'd probably be fair use), nor is it self-evident that compiling a million books into a Bloom filter that can recognize any random sentence but not output any random sentence would make the Bloom filter a derivative work. I think the unfiltered LLM being able to output near-verbatim copies of parts of the training set makes that case evident.)

I agree that the second step, of the output of the LLM being a derivative work of the LLM, is less obvious. And I agree that it's going to take case law before people are sure of the answer to that part. I hope the answer is "yes", and I think it'd do substantial harm to Open Source if the answer is a definitive "no".

mindcrime · on June 24, 2023

Fair enough. I think the distinction between "the model weights" and "the output of the model" was a little blurred when this first started. Sounds like we're closer to "in agreement" than not for the most part.

Vecr · on June 24, 2023

> License violations don't suddenly become acceptable just because you're violating a million licenses at once.

They might actually, at least in the US. I'm not sure how the laws and judgements are going happen/change in the future, but it's possible there will be some "quanta" of copyrighted work so that any fragment smaller than that will get rounded down to zero, so even if a work was 100% made from 1_000_000 "fragments", and somehow you could figure that out from the model, the result would be considered 0% derived/copyrighted, as well as being 0% copyrightable, as it's AI generated.