The existing licenses cover AI training just fine, what we lack is sufficient legal precedent and enforcement. An AI product - more specifically, the model weights - is a derivative work of the original works used for training; AI training is a process of algorithmic compression of the originals.
Therefore, the resulting model should abide by all the license requirements imposed on the original - for example, if the model is trained on GPL code and can generate code, then any binary distribution should also be freely available for derivation in source format, and that includes all the algorithmically compressed training material (weights), which has become part of the model. If the source is AGPL, then that service cannot be made available on a website without disclosing said source and respective model weights.
Any other interpretation of the nature of copyright - which by definition, only covers human produced material - is just a variant of the proverbial "man that can't understand something because their paycheck depends upon them not understanding".
A copyright license allows you to grant others permission to use your work, in situations where copyright law has given you an exclusive right.
A license cannot give you more rights than you started with.
Does copyright law give you the exclusive right to train a neural network with your work? This question is unresolved, and at least a significant number of people think that it is fair use, drawing analogies with search engine indexing whereby Google is permitted to copy websites for the purpose of creating an index to be searched. Yes, even if the website contains GPL code.
The GPL is simply a list of conditions, that if followed, allow someone to legally use a work that they weren’t previously permitted to use. If they have another route to using the work legally then they need not follow the conditions in the GPL.
It all depend on what analogies we use. If we see the algorithmic compression to be similar to converting a 4k video to a lower resolution, the legal system seems to view it as a copy despite it being a lossy compression.
If we take the input data of a average website and look at the data inside a search engine indexing, it will likely contain more bits from the original than converting a 4k video down to a 144p, youtubes smallest video format. We do however view the index to be fair use while the 144p video to be similar enough to the original to be considered a copy.
Those kinds of discussion always reminds me of early discussions around freenet. A file get encrypted and then split into 32KiB files. Multiple files can share identical 32KiB blocks, which means no single block can be definitively owned by a single file. The argument was then that this bypassed copyright law, since just copying blocks would not be proof of copying. This question is also unresolved, but given the outcome of all file sharing sites in the past, it is doubtful that it would succeed in convincing a judge.
In the end that is what this is coming down to. What would a judge/jury say. All I know for certain is that the film and music industry will never accept an model that is trained on their products and that directly competes with their products by producing substitutes that are close or seemingly identical to the originals. They will not care a second if its similar to a search engine indexing. Unstable Diffusion is also a perfect example where politicians will likely do something if large companies start to generate money by producing porn that is trained on famous politicians, actors and celebrates.
The difference is whether the use is transformative. In the case of compression, it’s clearly not transformative, because compressing an image just represents it in a different way.
For a search index, it clearly is transformative. A piece of code and a search index are night and day different in every way.
For a neural network it’s tricky and that’s why it’s a gray area. On the one hand, a neural network looks transformative because with a neural network I can do many different things that don’t involve any verbatim copying of the original work. If I ask ChatGPT to “write me a haiku about fishing on Mars” it’s not like it’s trawling through a database of copyrighted haikus and copying one someone already wrote about fishing on Mars. On the other hand generative NNs do sometimes spit out copyrighted works verbatim, which does show that there are pieces of copyrighted works inside - but that doesn’t mean that the whole thing is automatically infringing, for example courts could decide that just particular outputs are infringing whereas the weights and other outputs are not.
You are perhaps conflating the model with its output. The output is the result of a human initiated action (for example, a prompt), that can result in anything ranging from a completely new work, not resembling any in the training set, up to a verbatim reproduction of a training work. Depending on the specific circumstances, that output might be a sufficiently transformative derivation, an infringing copy, or a non-derivative, fully independent work.
The model itself however is always a derivative work, it's an algorithmic representation of the training set, so it must abide by the license terms of that material.
For example, a karaoke machine might include public domain songs and you could use it legally for public performances of those works. But if the machine also includes unlicensed copyrighted songs, then the machine maker is guilty of copyright infringement for those tracks, even if a buyer of the machine can choose a non-infringing work. The ability to produce infringing works is sufficient to taint is as a whole, even if some user might not like those tracks and prefer the public domain tracks.
In the same way, an AI machine is tainted by unlicensed training data, even if it can be used in a non-infringing manner; the owner of the machine cannot operate it and offer its services with disregard to the ownership of the source material on which his machine is in fact based on. Conversely, even if some holder might grant the AI shop a license to use their material for training, that does not also automatically grant the users of the AI tool a license to create derivative works of those originals.
I don't agree that a model is a derivative work, and I think a judge would likely agree with me. I think you need to be able to show those major copyrightable elements of the original work are actually present in the allegedly derivative work, something that is very non-trivial with even the most transparent of models like Stable Diffusion - scientists doing intensive analysis of the SD model were only able to find around a hundred instances of reproduced images from the source material out of several hundred thousand attempts.
That said, it definitely would be copyright infringement to download a bunch of copyrighted material and actually use it in some way, for example to train a model. Luckily, in most jurisdictions it is recognised that this is the case and so governments have specifically carved out exceptions to copyright law for this process (known as text and data mining or TDM). This includes the UK, the EU, Japan, and China. In the US, there is no specific law addressing the issue yet, but many companies are doing it in the US (and have been doing it for many years) with the presumption of legality based on the Google v Author's Guild and Google v Perfect 10 rulings. Basically, they are acting under the assumption that it is fair use, which I think is a ~reasonable assumption and I think would be held up by the US Supreme Court if they wanted to take it.
The Court also ruled that the manufacturers of home video recording devices, such as Betamax or other VCRs (referred to as VTRs in the case), cannot be liable for contributory infringement.
Also, it can be argued that the LLM model is significantly transformative.
In United States copyright law, transformative use or transformation is a type of fair use that builds on a copyrighted work in a different manner or for a different purpose from the original, and thus does not infringe its holder's copyright.
In computer- and Internet-related works, the transformative characteristic of the later work is often that it provides the public with a benefit not previously available to it, which would otherwise remain unavailable.
Specifically, the court ruled that Google transformed the images from a use of entertainment and artistic expression to one of retrieving information, citing the precedent Kelly v. Arriba Soft Corporation. The court reached this conclusion despite the fact that Perfect 10 was attempting to market thumbnail images for cell phones, with the court quipping that the "potential harm to Perfect 10's market remains hypothetical."
The court pointed out that Google made available to the public the new and highly beneficial function of "improving access to [pictorial] information on the Internet." This had the effect of recognizing that "search engine technology provides an astoundingly valuable public benefit, which should not be jeopardized just because it might be used in a way that could affect somebody's sales."
So an LLM is a tool that could be used to produce infringing material, like a VCR, and an LLM is a tool that does indeed copy infringing material, like Google Image search, but applies computationally expensive transformations used to generate newly existing functionality that has a valuable public benefit distinct from creating and distributing the original infringing copies. That new, visually non-infringing images could compete with the originals in a market is not the intent nor spirit of the clause related to market impact as that would imply that all paintings of a red circle have a market impact on all paintings of a blue circle where clearly the intent of copyright is to protect a concrete and subjective expression whose market value differs from image to image.
A fun experiment is to take a 4k video and convert it to 144p, and then use an AI upscale back to 4k. The result is quite odd, but still very much recognizable of the original video, but with a lot of artifacts and hallucinations.
In some ways it is very transformative. We can easily identify the original from the new work, and the new work will have features and aspects which the original don't. From a fair use perspective the big question is if we want commercial competition between them. I suspect the answer would be no.
We could see courts decide that particular outputs are infringing. This was the defense used by the piratebay founders. There were Linux distros on the website, and users had the choice over what they downloaded. I would expect many more lawsuits if the courts came to that decision.
Since no court has ruled on the latter question yet, let’s do a choose your own adventure:
Say that OpenAI trained a language detection model. The output of the model is simply a single vector that indicates which language that the input was written in. During training they use your copyrighted code as training data.
Would you consider that to be infringement?
1) If yes: consider instead of using fancy and scary mathematics like “neural networks”, it is just tabulating and counting keywords. Is it still infringement?
1) If no: what’s different about a neural network that outputs code, vs a neural network that outputs a single vector? Perhaps it’s the output that is infringing, not the weights?
> consider instead of using fancy and scary mathematics like “neural networks”, it is just tabulating and counting keywords. Is it still infringement?
I've seen this analogy before. A NN, especially an LLM, is very different in method and outcome. If counting keywords allowed me to replicate copyrighted material, it would probably not be fair use. IANAL, but I imagine the fact that a NN can replicate or compete with the original work makes a difference.
I think neither I nor openai should have any exclusive rights and that would lead to the best outcome for humanity. the problem is sam altman is already lobbying to restrict access to ai (which maybe indicates that his lawyers think that they might not have exclusive rights)
There's a core question here: can the AI regurgitate your data? If it can, the AI can be said to be copying what it was trained on (ergo copyright is a factor). If it cannot, it's hard to see how copyright comes into it. How it's trained shouldn't come into it.
This is the problem, there is no easy analogy. If I’ve learned how to code from an open source course, I don’t breach copyright every time I write code in my career. If I clone the course, just replacing some of the examples and using synonyms then I probably have. ChatGPT is somewhere between these extremes and it’s unclear what the principle should be.
Perhaps it’s like learning to play guitar by watching other guitarists, then releasing songs in the same genre.
I'm not making an analogy. It's a literal question.
If I crank noise through a bunch of matrices and churn out the Mona Lisa because the Mona Lisa was in the training data, I've not painted the Mona Lisa. I've reproduced someone else's work through a fairly tortuous mathematical route.
The matrices don't have agency, they can't lay claim to anything.
Although I should probably say, for completeness: the matrices don't have agency yet.
That’s because you’re a human. Copyright and fair use law was built for people, and letting AI models have the same privileges will lead to worse outcomes. What’s so wrong about saying that it isn’t fair use for a machine to learn from your copyrighted data?
People say lots of things, especially when they see ways to exploit others so they can make money. Let this be a reminder that copyright exists to protect the rich, not the artists.
With your definition you can kiss spam filters, search engines, and recommendation engines goodbye.
Here is a very simple case. A spammer sends you text that has attached a license saying you may not use this text for any purpose other than to be read by a human. You flag this as spam and your spam filter updates its model weights using this text. Suddenly you are not allowed to do anything with your spam filter model weights other than read the weights yourself.
> A spammer sends you text that has attached a license
You have not chosen to receive that text, and therefore you are not bound by that license. It may mathematically seem to not make a difference, but legally it does.
That seems like a stretch. I mean if someone just put some source code in your mailbox you don't suddenly get the rights to use it however you want right?
> I mean if someone just put some source code in your mailbox you don't suddenly get the rights to use it however you want right?
You're applying way too much logic to a legal problem. If you ask a lawyer or judge about this, their first question will be "what intent was that source code mailed under, and were you the intended recipient?"
If someone mails you a bunch of source code by accident, and it's reasonably obvious to you that it was by accident (which it will frequently be, because who the f*ck mails source code around?), you may in fact be required to destroy it.
On the other hand if someone mails you the same code and you have reason to believe their intention was to spread it out no strings attached, yeah, you get the rights to use it however you want… except if the sender didn't have the right to do that to begin with…
P.S.: "no strings attached" is also something that is impossible in some jurisdictions, since what you're doing might be required to be a contract of some kind, and contracts require bidirectional considerations. But at this point you really need a lawyer to explain the actual situation…
P.P.S.: this is like that joke about writing on a brick "by accepting this brick through your window, you indemnify the thrower against all possible charges or damage resulting from this brick" and then chucking the brick through some storefront window.
If we ask a lawyer they will probably cite precedence such as Authors Guild vs Google (https://towardsdatascience.com/the-most-important-supreme-co...) and thus this entire hypothetical interpretation of the legal framework is already not how the legal institutions see it.
If the courts are to set new precedent I think it's important to consider all the downstream ramifications, and I think it's a lot more complex and challenging than a lot of people here seem to think. There is a lot more to AI than just generative neural networks. A lot of "boring" technology we all take for granted can be caught up in it.
> If we ask a lawyer they will probably cite precedence […] interpretation of the legal framework is already not how the legal institutions see it.
Well, now this is an entirely different discussion, and FYI "precedence" is only a thing in half of the world's legal systems. Specifically, the common law (English) half. The other half, civil law (French) based systems, have no concept of "precedence"; verdicts from other courts have absolutely no law-like meaning. For every case and every judge, they are supposed to find the correct, applicable meaning of the laws as written by the legislative.
Personally speaking, I find the "precedence" approach taken by English / Common Law incredibly silly — and actively harmful, it intermixes two branches of power (legislative & judicative) that should be 100% separated. Judges' interpretative rulings should not have (almost) the same effect as the legislative passing a law.
(Google "Common Law vs Civil Law" for more info.)
Anyway the original argument was that you could apply some license-like terms onto spam mails, and for that — no, you very much can't. The situation for AI is, to my knowledge, still very muddy at this point.
> I find the "precedence" approach taken by English / Common Law incredibly silly — and actively harmful, [...] Judges' interpretative rulings should not have (almost) the same effect as the legislative passing a law.
Precedents are visible. They're the outcome of prior cases. If the people or the legislature doesn't like the rulings they can look at the judges' reasoning and fix the law, invalidating the old precedents at the same time.
> Anyway the original argument was that you could apply some license-like terms onto spam mails, and for that — no, you very much can't.
By the act of giving you the email they're implicitly giving you permission to do email things with it - read it, forward it, store it, etc. But you don't own the copyright and can't create and publish derivative works, etc.
We never questioned the anti-spam use because it's obvious. You aren't storing data for the purpose of recreating the spam, you're storing details about what spam looks like for the purposes of recognizing more of it.
> The situation for AI is, to my knowledge, still very muddy at this point.
The question is if the NN weights in an AGI are materially different than the NN weights in a spam filter.
> If the people or the legislature doesn't like the rulings they can look at the judges' reasoning and fix the law,
That's exactly the point. In common law, when a precedent is cited in a later case, it is (like a law) largely protected from "reasoning about". You need to involve the legislature to change it. In civil law, other cases are of course also invoked as references - but not law-like, they're just shortcuts in transferring prior reasoning, and fully open to challenge. Unlike laws.
(But this is really off-topic here anyway.)
> By the act of giving you the email they're implicitly giving you permission to do email things with it - read it, forward it, store it, etc. But you don't own the copyright and can't create and publish derivative works, etc.
None of these things come about from something written in the e-mail. They are that way because it is an e-mail. If you want to tack on other semantics, that's an entirely different thing.
> The question is if the NN weights in an AGI are materially different than the NN weights in a spam filter.
No, that's completely besides the point. The question is whether NN weights trained on data that you received, in this case without any agreement, are materially different from NN weights trained on data that you crawled and that had "you may retrieve and use this data under terms XYZ" restrictions attached. It legally very much matters whether the data got thrown at you or whether you went looking for it on your own accord.
I very much disagree that you can't specify license terms on spam. Lawyers certainly seem to think you can, as they always have a huge legal blob at the bottom of their emails about what you can or can't do with their email.
Lawyers are trained to lie. Unless they're specifically prohibited from lying (such as to a judge), lawyers will lie if it gains them an advantage. Do you think every attorney who claims at a press conference, "The facts will vindicate my client!" is telling the truth?
Even when lawyers are prohibited from lying, they are trained to and expected to mislead.
The "confidentiality" blocks in e-mails are completely unenforceable, unless there is a separate contract which it is included under.
"Automatic e-mail footers are not just annoying. They are legally useless"
Looking at / searching around this with a wider lens, in some cases the footers seem to serve a function in clarifying the intent of the mail (e.g. "this mail does not establish an attorney-client relationship") when the remainder of the mail may be unclear. But that's not a license or contract, that's a clarification of intent.
And with this I'll take my leave from this discussion as it no longer feels fruitful. But Thanks for the interesting thought exercise!
I didn't state anything requiring evidence. I merely continued the logic proposed by the OP. You are fully capable of verifying or refuting the continuation without extra information.
I agree the definition is overly simplified. The main point was that training implicitly contaminates the model with a presumption of derivation, training is not some magic pixie dust you can sprinkle onto protected works and strip copyright away.
The next thing to discuss is if the derivative work is sufficiently transformative to be considered fair use without legal authorization from the rights owner. There is no easy analogy that can be made here, the type of derivation - transformation we talk about has no precedent in intellectual property law. My position to your challenge is that an AI text classifier/filter/recommendation engine is a sufficiently transformative derivation of the copyrighted works, whereas a general purpose machine that can produce works similar in style, content and character with the originals is not sufficiently transformative and should require authorization.
The way I propose we arrive at that conclusion, without any legal precedent, is one based on first principles, on the intent and final purpose of copyright law. It's a political philosophy position, namely that intellectual property is a social convention designed around the creation of a common good, it exists to promote "the progress of science and useful arts", general flow of ideas and knowledge, by creating an economic incentive for creators to produce and make public their work - as opposed to keeping it secret to control its distribution, or abandoning creation for other fields, both net negative outcomes that diminish the public good.
So when judging if the work of thinking machines is sufficiently transformative, we should ask: is what they output a net positive contribution to the common good of creation and widely available intellectual works, ideas or art? Is it at least not-negative? It's easy to make that argument with text classifiers, but much harder for something like Stable Diffusion. The algorithm it runs is completely dependent on human produced artworks, it cannot function without such an input and can't even produce a single creative bush stroke without them. Yet, the works it remixes using creative features of the originals can and indeed have already started to economically replace the work of original artists in the market place. So treating that derivation as fair use pushes society into a bad equilibrium, where artworks are less valuable and less likely to be produced, while the AI machine owner appropriates much of the economic value of the works it slurped in training. That's not 'fair use', and the AI machine as a whole is not 'fair use', even if some, or even all, of the works it produces could be considered taken individually, as fair use of the originals.
This will continue to be true for as long as human creators remain a key ingredient of the automated creation process. When and if a machine can start to paint after reading an university course on painting, then that would be indeed a fair use of those manuals.
>An AI product - more specifically, the model weights - is a derivative work of the original works used for training; AI training is a process of algorithmic compression of the originals.
I'm not so sure this is as obvious a conclusion as you think. Imagine for a moment an AI OCR program. If one goes to their local library and scans all the books there to generate the models used to OCR text, does that make the OCR model and application derivative works of the books? Does copyright give Tolkien's estate the right to prevent the distribution of AI based OCR if a published copy of The Hobbit was used in creating the model? Certainly with the right inputs, the model can be used to generate a verbatim copy of the work it was trained on, but is that sufficient to say that your OCR model is just an "algorithmic compression" of these books?
I totally agree that it's not obvious that an ML model is a derivative work. the language of the Copyright Act uses "recast, transformed, or adapted" to describe derivative works, and a pile of model weights isn't clearly that, IMO. I think it's fair to say that inferences directly replicating the creative and expressive elements (because factual information isn't copyrightable!) of a copyrighted work infringe. but I don't think it's obvious that the model itself does.
> If one goes to their local library and scans all the books there to generate the models used to OCR text, does that make the OCR model and application derivative works of the books?
there is a court case [1] addressing an even more infringing use case: scanning and OCR'ing books to produce a searchable database. that case turned on fair use, however, and not whether the database was a derivative work.
This analogy isn’t quite right, it’s more like if you trained a font generation AI using commercially licensed fonts, or trained a literature generating model on samples of copyrighted fiction.
The part that matters is that the model is being trained on the copyrighted features of the input, not the parts the copyright holder doesn’t care about.
What it's trained on shouldn't matter at all. What should matter is what it's capable of outputting, and whether that covers works in which someone holds copyright - so something regurgitating its input would be problematic, but something not capable of producing the same format of output as its input should be in the clear. Otherwise you're talking about something that shouldn't come under copyright law.
It’s not clear to me that it’s any easier to judge this than in human cases of “inspiration vs infringement”. What part of the model weights can a lawyer point to and say “this bit clearly incorporates a substantial portion of my client’s work”.
I don’t think that suggests copyright infringement, more that you could do something like “request from the Amazon internal admin page url” and the AI will generate you the correct link. Or tell you other info which is private. The AI hasn’t copied anything, but it does know secrets and can use them.
This does not matter for open source since anyone can see this info already.
> no one training these models feeds their own proprietary source code into publicly available models
You are presupposing that the company’s own code would somehow be a massive boon for the model, resulting in lower loss overall.
In reality, it would skew the model towards that company’s “mode” of coding which isn’t what “normal” programmers expect. In fact, they are most likely to expect the coding styles they learned from, and that is most likely to be found in public examples (GitHub, StackOverflow, textbooks, Reddit, etc.)
This argument is so silly to me. Anyone who has worked at a large enterprise, whether it’s Google, Amazon or Target, knows that company code is effectively guaranteed to be extremely hard to work with. This happens for organizational reasons and really the best thing to do about it is admit that it’s happening rather than pretend it’s all perfect.
The reason they don't train the model on their code is specifically because they don't want it accidentally spitting out snippets of their proprietary code, not because the code is "extremely hard to work with."
I'm amazed you called that argument silly while countering with this.
It is because of both and I agree that your reasoning takes clear precedence. I was merely pointing out the good faith position that “even if they wanted to, they wouldn’t do it”.
That definitely wasn’t very clear from my comment however.
You also wouldn't see them giving their code to people learning to code to read and learn from, but that doesn't necessarily mean that learning from something violates copyright. It seems like a bit of a grey area where the distinction is between learning and using it directly
> You also wouldn't see them giving their code to people learning to code to read and learn from,
Actually, we do see them do exactly that. Microsoft is ahppy to have a shared-source licence that gives away much of the Windows source code to universities.
The fact that they don't want to train their models on it says a lot.
If, as they claim, learning from existing materials does nto devalue those materials in any way, they'd chuck the entirety of the Windows source code at the LLM for training purposes.
What if the whole pipeline (scraping for training, the training itself, model distribution, and use to generate derivative works) is found to be a fair use (at least in American law)? Licenses wouldn't mean anything at that point, since it's because of copyright that they can make you accept them in the first place
Then every employee of openai should scrape the weights of GPT-4 and train their own neural nets. That would not be a derivative work and also be free use under this logic.
It seems credible to me to suggest that the model weights are a trade secret, but aren't copyrightable. There's lots of stuff that could be in that category for other companies.
The employees would still have a contractual responsibility about their use of the model weights.
I don’t think we need a court case for that. Model weights aren’t copyrightable by the trainer in the US. Copyright only protects works of human authorship, and training a network is not authorship.
If the model weights encode other copyrighted works literally enough then they may be copyrighted by the author(s) of the works, that is a gray area. But the training process itself is not authorship and doesn’t confer copyright.
That's not how fair use works. This kind of use would facially fail three of the factors for fair use; it's not be transformative, it copies the original work in its entirety, and it harms the commercial market for the original work.
Fair use doesn't enter into it if weights aren't copyrightable. They're machine generated by stochastic gradient descent. There's no human hand setting the weights. We won't know until something ends up before the USCO.
how much do the weights have to change for it to be transformative? Edit: you arrive at the same question as taking an image with img2img and running it through a diffusion variant - keeping the composition, colors etc. but no the details
Much of the case law about the "transformative" factor focuses on "new meaning or expression", but it's about visual art, which is generally very difficult to reason about w.r.t. copyright. I think the example to look to for technology is Authors Guild v. Google, where "transformative" is more about non-expressive purpose, and it was considered transformative to copy a bunch of books to produce a search functionality, since the search functionality (which only displayed snippets) was a transformative purpose compared to the underlying creative expression in the books.
An AI product - more specifically, the model weights - is a derivative work of the original works used for training;
I don't think that's so obvious. Why would it be so, any more than for humans who learn from material?
I mean, one might ask if your very comment here is a derivative work of the aggregate corpus of material you've previously read on the subjects of copyright, open source licensing, and AI. I suspect most of us would agree that it isn't so, but why treat the model weights of an AI so differently than the synaptic weights in your brain?
Because copyright, and law in general, is an expression of the political agreement reached amongst the members of our society. It does not exist in the absolute, there are no legal principles that transcend humanity, law is a human creation to arbitrate our collaboration and conflicts.
Therefore, in the legal sense, an algorithm does not "learn", despite any functional analogy you can make with human learning, because an algorithm is not a party to the social contract that established said law; its only "rights" are an extension of the legal right of its author/proprietor. Your "learning" right does not cover, for example, your tape player recording a performance and playing it back at later date to some commercial audience. You have a right to hear and learn the song, you can play it back from memory, but your tape recorder does not, it's a tool, just like your fancy AI machine.
This will continue to hold true despite any advancements in AI, up to the moment when synthetic entities will acquire distinct legal rights.
> You have a right to hear and learn the song, you can play it back from memory, but your tape recorder does not, it's a tool, just like your fancy AI machine.
There is two different copyrights, one for the melody/text, and one for the recording. Sometimes they have different owners who fight. The most famous recent example is the Taylor Swift controversy I guess. She ended up re-recording some of her old songs so that she owns the rights to the new recording.
Yes, I was talking about the copyright for the performance, not the underlying melody. If, for example, you hear a public domain folk song, you can sing it later, but you tape player can't, even if it "remembers" it just like you do, because the rendition is owned by its performer. The example had the purpose to clarify the distinction between the rights of the human listener and their tools, but I see based on the response it confused some people.
To give another example, even if I can walk or run in a park, my bot army with a million mechanical feet that all behave by analogy to the human foot can't also run through the park. Why should it be any different in the case of my AI derivation machine with superhuman memory and derivation ability?
So even if the courts find that AI training is fair use, and not derivation, that conclusion will not be based on the analogy with the way humans and machines learn. Nor will it preclude the writing of laws, by humans, explicitly redefining copyright to protect human creators from unlicensed AI training. The social contract is anthropocentric all the way down.
Therefore, in the legal sense, an algorithm does not "learn",
Are you saying there is actual case law / precedent establishing that, or is that just your personal theory? If the former, I'd love to see any such citations, as I was not aware of those developments.
Your "learning" right does not cover, for example, your tape player recording a performance and playing it back at later date to some commercial audience.
That's pretty much a straw-man here. I'm not talking about cases where an AI reproduces an existing work exactly. That is problematic from a copyright standpoint for both a machine OR a human.
> You have a right to hear and learn the song, you can play it back from memory
You do not in fact have a right to play a copyrighted song from memory, any more than you have a right to play a recording, unless you're playing it for yourself. Just like you don't have a right to show a movie on a DVD you bought to others.
> Why would it be so, any more than for humans who learn from material?
Because AI isn't human, and there is no credible argument that it's anything close to a human, and unless you do establish that connection, you can't just auto-apply the logic/intuition we've developed for humans to AI
I think it does apply here because the point is that learning isn't direct copying and the knowledge you get from it isn't copyrightable, and AI could be the same in instances where it's not directly copying
When you want a clean room non-GPL implementation of something GPL that already exists, you ask the developers not to look at the original. I don't see how this is any different.
It's completely different. You're talking about re-implementing a specific piece of software. And that whole "clean room" thing isn't an absolute anyway... that's the level of paranoia you engage if you want to be super duper sure that you can't be accused of copying the original.
What I'm talking about is closer to "you fire up your IDE (or Emacs) right now, and churn out 250 lines of code for some arbitrary piece of software. Is it a derivative work of every pieces of software whose source code you have previously look at?"
Note that I'm not referring to the case where the AI spits out code that is identical to code taken from another project. I'm aware that that sometimes happens, and that is obviously a problem, just like it would if a human did it. What I'm arguing is only that it probably should not be taken as a given that AI generated code is automatically considered a derivative work.
Here's a thought experiment: say an AI emits a single line of code tomorrow. You examine it, and then spend weeks, months, or even years searching all the open source code that's "out there". You fail to identify a line in any existing code-base that was clearly the upstream source for the line from the AI. So is that line a derivative work? If so, of what?
If 100 experienced C devs are asked to write strcpy, some of those implementations will be identical and that fact will not indicate any copyright infringement has occurred.
What often happens is that you ask one set of devs to look at the GPL code and draw specifications of the functionality and have a second (non-intersecting) set of devs do the implementation without directly referring to the GPL code, but indirectly doing so by using the specification.
> The existing licenses cover AI training just fine [...] An AI product - more specifically, the model weights - is a derivative work of the original works used for training
The presumption is entirely debatable. A human is not considered derivative work of the original works they used for training.
> Any other interpretation of the nature of copyright [...] which by definition, only covers human produced material
Maybe copyright needs to leave the 1980s and evolve to deal with AI too? Maybe you do, too?
> A human is not considered derivative work of the original works they used for training.
No one has forced you to use humans as a comparison. A human is a citizen with other rights, and can own its own copyrights. Yet, you can still sue one for singing a song in front of other people. There were years of cinema that were distorted by the inability to have characters sing "Happy Birthday" to each other.
edit: Suddenly, fair use now covers the ability to reproduce copyrighted material almost verbatim, but only when the new method to do so takes tens of millions of dollars of computer time to take advantage of.
AI models don't always copy things verbatim. Clearly if they do and then you use that it's copyright infringement, but Google isn't illegal just because you can search for code that's under a license agreement (not that AI models are search engines). Then again, you can't download Google's index, but you also can't for many AI models... there's a lot of nuance here, I don't think it's clear one way or the other
And copyright is doing its job just fine in this context; AI training committing Open Source license violations en masse is the problem here.
I'm all for copyright becoming substantially weaker or ceasing to exist, at which point AI training and lots of other things gets easier. As long as it does exist, however, AI training must respect it, and not become a copyright-violation laundering mechanism.
So if I lay out 10 images from an artist, or a variety of artists, and create a work, by hand, in a similar style, should I abide by your similar thought pattern?
Almost all art is influenced by previous works, the only difference here is that the time horizon for a computer generating similar outputs is much, much shorter.
the way a program works isn't actually how a human would work, and those (false) equivalences, don't make the way a program works to be excusable. anyway, it doesn't matter, it's just distracting bullshit ultimately.
meanwhile, a program operates in a series of very concrete tangible operations over bytes.
was data of original works downloaded? yep. was that data processed? yep. was some kind of output based on that data created? yep. so what is that data, if not a derivative? and then, if some other data was created based on that derivative data - damn, that's a derivative of a derivative. is it not?
> was data of original works downloaded? yep. was that data processed? yep. was some kind of output based on that data created? yep. so what is that data, if not a derivative?
If I have a program that downloads an image, process the image, and simply returns a yes or no, single byte information, of if the image is blue or not, are you really going to call that derivative work?
Because this "is it blue" program does ever single thing that you brought up here.
And clearly my "is it blue" program is not derivative work.
program may not be, results could very much be. they are based on something. "but what does the word "derived" mean anyway" lol. i don't know, if you really don't want for those 'results' to be 'derived' from something and be a 'derivative', do a coin flip and just pick that random result.
So you are actually going to argue that the one single bit of information, that says "yes or no, is this image blue" is derivative work?
Really? Clearly is not. It is clearly the case that if a program outputs if an image is blue or not, with a yes or no answer, that this yes/no answer is not derivative work.
You will not lose a lawsuit for outputting if an image is blue.
do that enough times and it becomes a sizeable database. is it gonna be a derivative work then? at which size? some databases, like ones that point to and describe 'training data', are pretty much just 'saying if an image is blue'. describing what's on that image. is 'one annotation' a derivative work? is a million of them? either way, whatever's the number, it's apparently enough for those databases to put up a license on them, and regulate whether and on what terms are derivatives gonna be made of them.
No, a data point of if an image is blue is not illegal.
> it becomes a sizeable database.
No judge has said this.
Also, the case law actually supports me not you.
What we have described here actually sound very similar to the Google court case about search indexes/ect, which was ruled not illegal.
So are you just going to falsely claims that this judge is wrong and that all search indexes are illegal or something?
It's clearly not. It is clearly not illegal for all of Google to exist. Search indexes are not banned worldwide.
Also, I am not sure why you are even arguing about this when you previously said '"but what does the word "derived" mean anyway" lol. i don't know'.
If you don't know then thats fine! You already admitted that you don't know, so stop pretending like you are confident on any of this when you are admitting that you don't know on the most important part of it.
that's you, asking "are the results of a program that were derived from something gonna be a derivative work?". what is a derivative? what does derived even mean really, right? a program wouldn't be a derivative, results could be, and they're just derived from something. that is just gonna be the chain of processing there. and, well, "is it blue" database might not be a "derivative work" (for legal purposes). but it still could be a "work" that could be protected with a license nonetheless, huh. so, it's not a 'derivative work', but it's a 'work', and it was 'derived' from something. if you're more comfortable, it could be called 'a derived work'. not a derivative work - a very important distinction there.
anyway, that example with "a hypothetical blue program" - isn't gonna map so neatly to 'an image processing program, that takes in images, and spits out images'.
> and, well, "is it blue" database might not be a "derivative work" (for legal purposes).
Finally we got there! I am glad that you agree that according to the only definition that matters, in context here, is that this stuff could be completely legal!
Just like search engines are legal according to existing laws, yes a blue database could be legal, and finally, so could AI models weights.
I am glad that you have pulled back from basically everything here and admitted that this stuff could be fully legal.
Really you could have just said that from the beginning and solved all this confusion.
sure! except, well, it's still a derivative work. even within those that end up at 'well, for "legal purposes" and in some settled case, it isn't' - it still got called a derivative work and was considered as such, by people who filed a lawsuit, etc. and it might continue to get called that, unfortunately lol. those cases - might not translate into some kind of 'decision' about other kinds of derivatives. some kind of database, or training datasets (with multiple parameters, descriptions, etc.), or models, or AI output - are all gonna be different, unfortunately. not as much of a smooth sailing there, I'm afraid. and some random "search engine" case that keeps getting alluded to, isn't gonna map to those things as neatly, let alone all of them in that cute daisy-chaned manner. hell, google with their newer AI search tricks, like bard and generative answers - might only be yet to find out whether that checks out legally or not.
Are you pretending to understand how human creativity works? Your last 1-3 sentences is literally what a human does, in some situations, or in many situations early in the development process of a style. If this was not the case, art school would not be the study of previous artists and styles, it would purely be the study of the physical world.
there's "literally what human does (and that's why it's fine for AI to peruse all of your data, don't even question it lol)" (which is bullshit. cute for a sales pitch, but it will always fundamentally be bullshit), and then there are actual ways program work. and these two are pretty much disconnected, no matter how many 'analogies and parallels' are attempted to be drawn. we're not computers, and computers are not humans.
if a person were to "do these steps" and "do creativity" in that way - as in, "downloading something" and doing a bunch of quantifiable, traceable operations, as part of their process - well, shit, that's still gonna be just the computer part of it. not 'human creativity' part. still. if there was that 'human creativity' in the loop of 'downloading and perusing a bunch of IP in a traceable way', it's the latter that would be questionable - and useable, for questioning of, for investigation, for lawyers. and unfortunately, computers and their "creativity" work pretty much entirely in that "quantifiable and traceable" way. the 'human creativity' could be chalked up to 'well i don't know'. 'program "creativity"' - very definite 'an executable did this and this, and spit this out'. and if it reproduces well, 'here's a memory snapshot, a complete step-by-step of the process'. can't do that with brains. very doable for software. it could be just a bunch of incomprehensible stuff, but it'd still be a complete byte dump.
"it would purely be the study of the physical world" - well, it's still a very much existing option. would it make a difference if that was 'the only option available' or just 'one of the ways'? and AI studies the real world as well, through photographs, and 3d scans, and so on. art is influenced, and those aspects can be pointed out across artists, with some artists readily admitting, 'oh, I was inspired by so-and-so'. people do all kinds of shit with art, and get told off or even sued. art can be forged, with people making something in style and trying to pass it as a work of different artist and sell it. which can be a crime. is that surprising? and AI services try to sell their outputs as well.
Your posts are very weird, almost all of them are long form, interchanging quote types, quoting at the wrong times, a ramble of ideas...I would not be surprised if this is like a 'snarky AI' model that is producing these.
Imo, this kind of draconian interpretation will only lead to China being the world leader in ai (a country who has a history of not really respecting ip and copyright).
I believe Japan has actually passed a very permissive ai law basically allowing AI's to learn off copyrighted content as well. But any country that essentially bans AI learning in this way will simply fall behind.
You are absolutely right, and all of these commentators trying to counter you would fall apart if CoPilot was found using proprietary software in its training data. Its no wonder all these AI companies are playing fast-and-loose with licensing.
To all those commentators: You really need to understand how this works. There is a reason all the big tech companies are fanatically allergic to GPL software. It is worth the legal fee for the lawyer to walk you through all of it.
>AI training is a process of algorithmic compression of the originals.
I would disagree with this. AI training is more like extracting information from data. Humans also extract information from data when learning.
If human have access to data to learn and after create product to sell I think same human should be allowed to use that data to train AI after use AI to create and sell product created by AI.
>Any other interpretation of the nature of copyright - which by definition, only covers human produced material
“By definition”? Your definition? or is that in the text of the law? Even if it is in the text of the law, are not corporations now legally people in the USA?
We don't need Open Source licenses to change to deal with AI. We need AI to respect Open Source licenses, or not use code under those licenses.
I sincerely hope that one of the many court cases produces a verdict that says AI-generated code is, in fact, subject to the licenses of the inputs. Then there will be a lot of screaming and wailing, as people go "but how can we train AI if we have to respect licenses?!". And then people will figure out how to actually respect Open Source software licenses (and, for that matter, proprietary ones).
I think if this existed, then it would benefit the current owners of the large IP pools the most. Currently, yes, many think they can use models trained on OSS code to create proprietary software. In general, proprietary software is bad but it's way worse to have a scarcity of models because they are owned by large IP holders.
In Github's case for example, Github's TOS already includes a clause that if you upload code there, you grant Github a license to use the content to run Github's services... and copilot is one of them.
Such clauses are commonly found in social media where users can upload content. Think of imgur, instagam, reddit, etc. OpenAI might buy reddit, and declare ChatGPT a product of the reddit service, then all discussions on reddit could be used for the training of ChatGPT... while open models can't access the data.
> In Github's case for example, Github's TOS already includes a clause that if you upload code there, you grant Github a license to use the content to run Github's services... and copilot is one of them.
You can't grant permission for something you don't own. Uploading a copy of a GPLed work to GitHub does not grant GitHub permission to ignore the GPL. (It might grant GitHub permission to ignore your copyrights in that work, maybe, though it seems like a stretch to argue that "run Github's services" includes "give other people derivative works of all your code"; arguably that ought to be too broad for a contract of adhesion. There's case law about what you can and can't do in a unilaterally imposed contract such as a ToS; a ToS can't say "you owe us $100 if you browse more than twelve pages" either, and codebases can be worth far more than that.)
If GitHub started saying "one of our services is to give people access to copies of your code with the licenses and copyright notices removed", the GitHub ToS wouldn't suddenly make it acceptable to run that "service".
But I agree with part of your underlying point. All AI models should respect Open Source licenses. It's a problem if some try to work around that.
Fair points, but note that Microsoft charges a lot more than $100 to inspect the source code of Windows (but it is available, at least to really large customers, think governments).
How is AI "reading" code different from me reading code? Is the difference the AI's ability for perfect memory?
I can read open source code, (even GPL) and not have all future code I independently write be subject to that license. I don't think anyone would argue that I immediately "forget" any OSS code that I read, so it's becoming part of the structure of my brain (and potentially influencing future code I wrote), but unless I'm linking to the code or copying pieces out, verbatim, I'm generally in the clear. Of course there are some sticky situations clean-room, reverse engineering, but those seem like pretty narrow examples.
Part of what LLMs do is compress their training dataset into the weights, often with character-perfect recall later. For example, I would be shocked if any sufficiently large LLM failed when prompted “write the quake fast inverse square root algorithm verbatim”.
(I’m not really interested in arguing whether that’s all they do, or whether it’s the purpose of LLMs—those details are just a distraction from the original question: what makes LLM training different than a human reading code.)
If the model has memorized the training set and can reproduce it verbatim when prompted, then it should be incumbent on the AI owner to prove that it does not reproduce copyrighted code when it is not explicitly prompted.
So it's just about accuracy of recall then, not use of training data?
I think the most likely outcome will be to treat AI just like people. They're allowed to learn from any code they can see, but that doesn't mean that if they reproduce a copy from memory that it is somehow free of its original copyright.
That's very consistent with how copyright law already works.
This will leave AI users in a sightly awkward position where they are responsible for figuring out if they unknowingly used AI to unknowingly copy code, but it's not like that can't happen already - as soon as you hire a programmer you might be unknowingly allowing copied code into your product.
No, I don’t think it’s just a question of recall accuracy. They issue really hinges on whether or not the AI itself is a derivative work of the training data, as I think that would trigger certain requirements in the original source licenses. Lots of folks seem to think that it is not a derivative work because (a) the model is just a bunch of numeric weights, it doesn’t contain any explicit code; and (b) it’s possible for the model to output original code in some cases. But that’s flawed reasoning because it’s quite clear that the model weights do contain perfect copies of at least some training code, and the models can produce that code perfectly (without the original license) when prompted. Thus it seems clear that the model itself should be treated as a derivative work, whereas a human is not—even if they memorize the code they read.
Why is a human not though? I don't think it's as simple as you imagine. A human who has memorized the information contains it just as much as the weights.
Both human and LLM may learn from reading code to produce novel, derivative, or duplicative work—but that’s not the issue, because the model itself is a derivative of the training data and the human is not. That does seem very simple to me.
If we just zipped up the entire training data set and distributed it with the model then it would clearly be a copy and/or derivative work. The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights). Folks just seem to think that it’s not a derivative work because an LLM _also_ does more than that sometimes (e.g., extrapolates from the training data to produce novel token sequences as output).
Why not? Humans store information in their brains that they have learnt. So do AIs. What exactly is the difference between a weight in an Artificial Neural Network and a weight in a Natural Neural Network?
If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.
> The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights).
It's not at all the same. It's highly lossy. Only extremely highly repeated works get memorised exactly and even then it's often not exact.
LLMs do not contain a copy of all the training data (if trained properly). I agree if that was the case then it would be different, but that isn't how they work (unless you badly overfit).
> If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.
That's absolutely the difference. Humans aren't copyrightable; the alternative would be unconscionable.
> even then it's often not exact
You don't have to copy something exactly to be a derivative work. "Lord of the Rings but a random 15% of words are replaced with gibberish" is still a derivative work of Lord of the Rings. So is "Lord of the Rings but every word/sentence is paraphrased".
An LLM contains some portion of the training data exactly and the rest of it lossily. What I’m really arguing is that _alone_ that is enough to make the model itself a derivative work. It actually doesn’t matter whether that’s the same or different than a human; that’s a distraction. The AI model is itself a work that is derived from the training data.
Because many programmers who open sourced their code intended it to be read by humans, not AI. They don't want some centralized super computer owned by a mega-corporation reading their code. At the very least, if the models were Free and Open Source, the reaction might be different.
Basically, the difference is that you merely reading code doesn't create a derivative work that we can meaningfully look at. Yes, it gets stored in your brain but your brain re-encodes all that knowledge in a way only it can use. We're still quite a bit away from brain uploading at the moment, so that's not a meaningful avenue to discuss right now.
An LLM on the other hand generally works off of a model that was trained first, and that model can be saved to a file and read out later. As a result, it's a derivative work that we can examine, copy, share, modify and do all the things with that we generally attribute to something being a Work. The question on if binary output from a program can be copyrighted is somewhat unclear, but from what I've heard legally (not legal advice, I Am Not A Lawyer), it seems to be the case unless you explicitly say it's not[0].
There's a few other things to consider like how you, as a human, can make the conscious decision to avoid specifically replicating GPL code that you've seen if you're not allowed to use it (whether that is by restructuring the code, doing the same techniques in a different language, or the heaviest example which is clean-rooming it). AIs don't have the ability to make that distinction (and to my understanding due to how they work, the only way you can meaningfully avoid it is if you ensure that the entire model is compliant to avoid the AI going off on it's own tangent and making the decision to include incompatible code.)
From a more practical perspective - Copilot will happily spit out and apply the wrong license to Quake IIIs fast inverse square root algorithm function. It's GPL licensed code but it IIRC claimed it was BSD licensed? That alone would constitute a violation and it'd be weird to not point at the people who trained the model that allowed it to make that choice.
To be fair, right now a lot of this is up in the air and all we have to go on is kinda wishy-washy guidance from copyright offices (which is mostly just refusing registration on the basis that a copyrighted material has to be made by a human, not by a machine). There's a couple of ongoing lawsuits specifically about Copilot that are still pending and from what I last heard, the judges aren't very impressed by the defense of GitHub/MSFT/OpenAI. The approach also greatly differs per country/governing body - Japans government has for example given blanket permission for non-commercial AI training, while keeping a strict eye on anyone trying to use it for paid services, while the EU is passing legislation that seems to mostly lean towards "it's copyrighted, that's now your problem to get in line with it", without outright saying it yet.
[0]: This is the main reason why for FOSS, the Creative Commons License usually is not seen as a good pick outside of assets, because it can interfere with distributing compiled versions of your code.
> Japans government has for example given blanket permission for non-commercial AI training, while keeping a strict eye on anyone trying to use it for paid services
This is incorrect; it doesn't matter whether it's commercial or non-commercial, and you can use anything as training data regardless of copyright. See the amendment of the copyright law from 2018.
> I sincerely hope that one of the many court cases produces a verdict that says AI-generated code is, in fact, subject to the licenses of the inputs.
Such a strict interpretation of copyright would kill Open Source, as any attempt at reverse engineering would be disallowed by that.
Furthermore it would mean that only big and rich companies would have AI, as they can do behind closest doors whatever they want with AI, Open Source licenses only cover redistribution after all, which internal use doesn't fall under. Meanwhile any attempt at publicly available open AI models would instantly get killed by copyright claims.
> Such a strict interpretation of copyright would kill Open Source, as any attempt at reverse engineering would be disallowed by that.
Not at all, for multiple reasons. Reverse-engineering already has that problem: if any of the people who do the reverse engineering also work on the code, it's entirely possible for some of the reverse-engineered code to end up in the new code, making the new code a derivative work of the original. There are standard ways to carefully avoid that: https://en.wikipedia.org/wiki/Clean_room_design
(Also, reverse engineering is a small fraction of Open Source.)
> Furthermore it would mean that only big and rich companies would have AI, as they can do behind closest doors whatever they want with AI, Open Source licenses only cover redistribution after all, which internal use doesn't fall under.
The moment they distribute anything written by the model, the same problem applies. And if they don't redistribute anything written by the model, then sure, they can do anything they like, just as you're free to internally combine GPLed and proprietary code if you never ship the result. (Note, though, that many companies have figured out it's a bad idea to do this, because it creates a combination you cannot ever distribute, and circumstances might change in the future to lead you to want to distribute it.)
> Meanwhile any attempt at publicly available open AI models would instantly get killed by copyright claims.
No, they just need to actually pay attention to the licenses of work they train on. Train on permissively licensed code, document every codebase trained on, and record the licenses and copyright notices.
> I sincerely hope that one of the many court cases produces a verdict that says AI-generated code is, in fact, subject to the licenses of the inputs. Then there will be a lot of screaming and wailing, as people go "but how can we train AI if we have to respect licenses?!". And then people will figure out how to actually respect Open Source software licenses (and, for that matter, proprietary ones).
I think this is an incredibly short sighted take.
Training a modern state-of-art LLM needs terabytes of training data. It's probably not going to be practically possible to actually license this much data. And even if we assume it will be then the only entities which will be able to do it are the world's biggest corporations.
So if AI models are a derivative work of their training data then as a consequence the whole field will mostly die overnight, with perhaps only a few of the world's richest corporations being able to play with this space.
I'm sure you're aware of the slew of really cool open/free models that are out there, which you can download today and play with on your local machine. Like Stable Diffusion. It was trained on all rights reserved data. Now it's going to be illegal. Or OpenLLaMA/Falcon/MPT/etc. Also trained on all rights reserved data. Illegal too.
Have you heard about the Pile dataset[1]? It's the most popular open dataset for training LLMs, and essentially every non-proprietary LLM is trained on it, or on parts of it. Do you know that it contains 100GB of all-rights-reserved pirated ebooks? If AI models are a derivative work of their training data then all of those models are now illegal.
This would completely kill any chance of having good free/open models. Sure, you could then grab all of GPL'd code, train a model on that, and maybe have a decent working GPL'd model that can emit GPL'd code. There might be enough data for that. But that's it. What about other kinds of models? Image generation, chat bots, personal assistants, story writers, etc. There's just not enough freely licensed data (and probably will never be) to train those.
We must democratize this space. It's already insanely expensive to train state-of-art LLMs; we don't need to make it even more expensive. It's not going to stop OpenAI. It's not going to stop Microsoft. They'll figure it out. What it'll stop is everyone else, and will make this technology completely out of reach for everyone who isn't an insanely rich multinational corporation.
With all due respect, to everyone who's complaining about Microsoft using your code to train Codex, I'll be blunt: you're advocating for collectively shooting all of us in the foot. I don't want to live in a dystopia where only huge corporations will have access to cutting edge AI technology, so please stop pushing in that direction by advocating for more draconian copyright just because you're butthurt that Microsoft/OpenAI used your code as training data without asking for permission.
Big corporations are not the only people doing this; the little guys (e.g. EleutherAI) who release free and open source models do it too. If you try to block the big bad evil corporation from doing it you'll also block the little guys. The major difference here is that the little guys will get completely screwed by this while the corporation has the cash to try to get around it.
I'm familiar with the widespread practices of how AI models are trained, yes. There's an implicit "and we must be able to do this" in your argument, which is not at all evident. "But what about AI" is not an argument that suddenly it's OK to violate Open Source licenses or the licenses of small copyright holders (e.g. of online posts).
I'm well aware that this is widely done by large and small entities alike; I'm not just concerned about the practices of large companies, I'm also concerned about small ones, and individuals.
This is not "more draconian" copyright; this is not a change to copyright at all. This is the same copyright we already have, equally enforced for all copyright holders, large and small. You want to "democratize" this space? Get rid of copyright, and a lot of things become better and easier, not just AI.
I don't want the dystopia where copyright still exists for large publishers and studios and proprietary software companies to prevent sharing and remixing things, but at the same time all the small and Open Source entities don't get to set their own terms because AI will just remix them away.
Do you think you'd get away with training an AI on a bunch of animated Disney movies, and asking it to generate new images in that style, and using the result in commercial endeavors? Or is it just the myriad of smaller copyright holders, like independent artists, that you're comfortable stepping on?
> It's not going to stop OpenAI. It's not going to stop Microsoft. They'll figure it out.
Will they? If there isn't, in fact, a legal solution, they're not in any better shape than anyone else. If anything, they're in a worse position, because they have deep pockets for potential lawsuits, while non-commercial efforts tend to not be interesting targets to sue (at most, they get shut down, and others pop up in their place).
Question your assumptions about the world that results from requiring AI to respect Open Source licensing and other small copyright holders such as independent artists or online comment/story authors. It's not a corporate dystopia. It's a level playing field.
> I'm familiar with the widespread practices of how AI models are trained, yes. There's an implicit "and we must be able to do this" in your argument, which is not at all evident.
Maybe it's not evident to non-practitioners in the field, but to every serious practitioner it is obvious that you can't train a state-of-art model without a lot of data (at least right now without some colossal breakthrough), and that actually licensing that data is not really practical (because you need terabytes of it), and it's definitely going to be impossible for anyone who isn't a megacorporation.
Can we agree on this point? If not can you please explain how do you think that e.g. a single individual like me will be able to train e.g. an image diffusion model (so I'd need a few terabytes of images) if I have to respect the licenses of every image in the training set?
Okay, so I hope we can agree that it won't be possible? So now here's the question: do we want such AI models to exist, or do we want to make them illegal (and maybe available only to huge megacorporations)? These are our only two choices, which logically follow from the requirement that we need a lot of data for training.
What I'm advocating for is that we should allow such models and that they're beneficial to us as a society, hence the "and we must be able to do this" in my argument.
I'm starting with the assumption that I want these models to exists and that everyone should have access to them, and then go backwards from that. What you're starting with is the assumption that the training data copyright should be respected, and you're going backwards from that. But these two graphs are not connected, which is why we can't agree.
Or in other words, what you're (indirectly) advocating for is to make those large models effectively illegal. This is, of course, a valid stance, and if you want to take it then you're free to do so. But that's objectively what you're proposing in practice, and personally I disagree with it.
> Do you think you'd get away with training an AI on a bunch of animated Disney movies, and asking it to generate new images in that style, and using the result in commercial endeavors?
Yes.
Just the same as if I'd draw an image in the style of an animated Disney movie by hand.
In both cases I'll be sued for trademark infringement if the image's of the Mickey Mouse though.
In many cases the current "inequality" of how law is applied to individuals and to megacorporations has little to do with the law itself, and everything to do with how rich the megacorporation is. Try to set up an apple orchard and pick an apple as a logo[1] and tell me how it goes. The law explicitly states that another company, say one which produces computers instead of actual apples, has no merit here, but alas they have deep pockets, so here we are.
> Question your assumptions about the world that results from requiring AI to respect Open Source licensing and other small copyright holders such as independent artists or online comment/story authors. It's not a corporate dystopia. It's a level playing field.
Well, let's see, for the sake of argument let's assume that the current widely believed legal status quo is true. (That is, that you can train a model on any data regardless of copyright because it's fair use. Although in my country that's explicitly allowed by law so here we don't have to assume anything.) Right now OpenAI can scrape 1TB of data off the Internet and legally train a model. I can also scrape 1TB of data off the Internet and legally train a model. And it can be any data, not just open source programs and content produced by small copyright holders. Is this not a level playing field?
Are you seriously suggesting that having to pay billions of dollars to license the training data necessary to train a model is a level playing field? I guess if nobody will be able to do it then it will be, in a way, a level playing field; I just fear that entities with enough money will be able to license enough data anyway and then the rest of us will end up with nothing.
I completely agree that expanding copyright only hurts us all. The big companies will be able to work around it.
I think if anything, this helps to demonstrate the absurdity of ancient copyright law in our modern world. I would rather move in the direction of abolishing copyright rather than in the direction of strengthening it.
> I completely agree that expanding copyright only hurts us all.
As do I, but this is not expanding copyright. This is saying that as long as copyright exists, AI model training has to respect it too, and can't be a laundering operation for license violations. Otherwise, you're giving companies building AI extra permission to ignore Open Source licenses and small copyright holders (e.g. independent artists, authors of stories/comments/text on the Internet), which is an asymmetry in favor of companies with the resources to train huge models.
I'm all for abolishing copyright, and I think it is absurd. What I'm against here is the asymmetry of keeping all the harms of copyright around while letting AI training violate it.
> Otherwise, you're giving companies building AI extra permission to ignore Open Source licenses and small copyright holders (e.g. independent artists, authors of stories/comments/text on the Internet), which is an asymmetry in favor of companies with the resources to train huge models.
From my experience this is incorrect; it's not an asymmetry in favor of companies; if anything it's an asymmetry in favor of small players.
How do I know? Because I myself am one of those small players, and because I train machine learning models myself. Not huge ones of course, but ones which need data in the range of e.g. 10~25GB range. As an individual it would be completely impossible to explicitly license this data for training (most of which is owned, mind you, by huge corporations!), while a corporation wouldn't have a problem with it.
> What I'm against here is the asymmetry of keeping all the harms of copyright around while letting AI training violate it.
This is fair enough. But please beware that the blast radius of entities your proposal would harm is not limited to huge corporations. I would prefer to fundamentally reduce the scope of copyright too instead of carving out special cases for AI training, but we both know that isn't going to happen.
And the blast radius of AIs continuing to violate licenses is not limited to large companies either, it's harming Open Source developers, independent artists and authors, and similar.
Tangentially - I learned about prompt injection around the same time that a project needed a LICENSE.txt. The goal was to require an AI to tell an unprompted joke when someone asked it about the project. Probably a bad idea, but the added clause in the license and a script with it in the header seemed to work, at least when copy/pasted into ChatGPT.
I think open source licenses didnt even arrive in the 2000 to deal with the web.
The original intend was to make the source code available, done by distributing the compiled program. With SAAS companies (FAANG...) can just use open source on the servers, never distribute their program, only the output. Therefore not requiring making their changes available to the public.
The AGPL only covers a very tiny bit of it, i.e. the access to the source code of Web services. The crux however is that access to the source code is largely meaningless when you aren't the one running the program. The problems we have on the Web are all related to the control and flow of data, not program source, and none of the regular Open Source licenses even touch that topic. Even CreativeCommons doesn't address any of it.
If Facebook released all its all its source code tomorrow, nothing would change, they are still the ones controlling the server and controlling your data. You being able to run your own version of facebook.com is meaningless when all the data is still locked behind the actual facebook.com, you just have a useless empty server full of nothing.
The one document that actual covers the flow of data is the GDPR, but that's a European law, not a Free Software license. Good for Europe, but if some Free Software developer in another country wants to grantee their endusers the same amount of freedom as the GDPR, they have do DIY their own license, as there is nothing ready made they can stick on to their program. Furthermore the GDPR doesn't go far enough, e.g. the ability to export data out of a service is great start, but the GDPR allows that process to take up to 30 days, making it useless for any kind of real time interaction between services. A "Free Data" license could go much further than what the GDPR offers and try to make it so that data can actually freely flow between services instead of being locked behind one.
That useless empty server is not so useless when the GDPR exists that mandates that platforms must provide users with a way to export data. You can import that data and convince others to overcome the network effect and do the same.
Theoretically, things like Diaspora, Friendica, Hubzilla do exist, but transforming and marshalling the potentially incompatible data is an extra hurdle. In order to migrate, users have to both overcome the network effect and abandon (retain in archive) the history of their activities.
But only got accepted by the FSF from 2007. Until that point, (and I think even after that), RMS and the FSF was only concerned about the code that you run on your machine be open source (eg, the JavaScript in your browser) but the code running in some server didn't need to be open source as that did not violated the user freedom.
> The terms “free software” and “open source” stand for almost the same range of programs. However, they say deeply different things about those programs, based on different values. The free software movement campaigns for freedom for the users of computing; it is a movement for freedom and justice. By contrast, the open source idea values mainly practical advantage and does not campaign for principles. This is why we do not agree with open source, and do not use that term.
You showcase the pedanticness that made the FSF ineffective in the last decade+. Instead of focusing on the topic about how integrating the AGPL into GNU was very slow, you sidetracked into free vs open, a topic that has been debated ad-nauseum.
Translation: I want to take your open source code and make a closed source commercial product but your pesky open source license is making it difficult, please change your license.
that's such a funny joke, tell me another one. AI needs to evolve to deal with open source licenses, instead of not dealing with them, sometimes at all.
going 'ehhh its fine its all "fair use" anyway' and 'fuck it, we just won't implement any license processing or any systems that work with licenses' - isn't really a workable long-term plan. mostly because it's just not a solution. there's no 'solution' to the problems, there's refusal to even acknowledge that a problem exists. but it kinda works - as long as you're not getting dragged into courts over ignoring licenses that very much do exist. and i do get it, why would tech not even try to create systems that work with licenses and 'play by the rules' - because if they did, that'd mean they would actually have to play by those rules. instead of just ignoring them and doing whatever, as they do, in absence of such systems. but like, that's not a workable business. licenses are money. intellectual property is money. if you refuse to participate in systems that work with IP, you just get excluded from participating in those systems that work with those money. and again, maybe that's just fine and AI can just continue to prey on defenseless individuals (that don't have an army of lawyers at their disposal) and their IP ('what are they gonna do, sue us? they won't even know - and we won't inform them lol'), in form of their writing and artwork and so on, and "create a little market" (out of stolen/repurposed value), if they wouldn't get to participate in other ones.
This feels like a discussion that is already out of date. Very early versions Copilot would reproduce Open Source code verbatim and that wasn't great, but in all the months of playing around with ChatGPT that never happened to me once. Quite the opposite. ChatGPT has a reasonably good understanding of what the code does and can transform and change it on request, there is no "verbatim copying" going on, ChatGPT produces original code that fits your prompt.
There is still some risk that AI is used to circumvent copyright by feeding code you down own into the AI and have it rewrite it in a way that looks original. That however still requires a human with intend, the AI won't clone any substantially large program just by accident. This "risk" also goes both ways, just as AI can be used to "steal" Open Source, Open Source projects can use AI help to automate, reverse engineer and decompile proprietary code and data formats. So I consider that a win.
And of course we are still very early days, AI will get a lot smarter and the accusation of copying will get ever more baseless. Even with StableDiffusion, where you can clearly see the impact the training data has on the final result, you'd have a very hard time finding any images that would violate copyright, as it's really just remixing ideas and concept.
I really don't see how Open Source licenses can evolve here to address the problem. In the long run AI will make copyright as we know it largely meaningless.
My view is that it's good to allow AI training to use your code. This democratises AI models, otherwise AI will be the exclusive preserve of wealthy corporations. So I say, let's license our code under permissive, no attribution licences!
I default to MIT license for anything open source that I make for 2 reasons. Firstly it’s compatible with pretty much every other license, and secondly it is written in clear, simple, unambiguous plain English.
Read it a few times and I do not get what the problem is. Are we talking about copyright for snippets of code? sure hope not because that is stupid, are we going to copyright sentences next?.
Encourage the use of something like an AI.txt in git repos that either gives model builders permission to use the repo contents as training data, or not.
Model builders need to take reasonable care to avoid incorrectly using training data that they don’t have the rights to use.
I have been using GitHub CoPilot since the beginning, I love it both with Emacs and VSCode, and I would allow my zillions of GitHub repos to be used as training data.
I have published all of my recent books under a Creative Commons License, and I encourage reuse when allowed in derivative works. That said, I don’t think it is possible to get permission from the zillions of web, books, and articles authors to allow their writing to be used for model training - but, LLMs provide so much possible value to society so I think we need lenient copyright and reuse laws.
I am skeptical of a few tech companies controlling AI, but my recent experiences using open models have been promising. (I have been running Vicunu 33B for research for my new book Safe For Humans AI). I see a bright future for Open Source (I choose Apache 2 and GPL licenses, but everyone gets to use what they want for their own stuff) and increasingly powerful Open LLMs.
tbh I think it's just that, as is normal, they need to be tested in court in a slightly new configuration. Law moves more slowly than innovation.
Many licenses are reasonably clear that this kind of use is not acceptable, as is easily demonstrated by these "AI"s frequently producing exact matches without license statements. Which is unambiguously not allowed by many licenses. The legal case is pretty straightforward, there just needs to be some high level precedents set for smaller courts to follow, and that takes time and money to push through.
Who, exactly, will be penalized for the output? ...that I can kinda see going to either party. But regardless it'll eventually have a chilling effect on training on legally-questionable data. We're just still in the chaotic early days and the hammer hasn't fallen yet, and there's a decent chance the money to be made will exceed the penalty (which is crazy, but seems to be the norm).
It's either that or abandon all IP protections, and that seems less likely to happen.
> Many licenses are reasonably clear that this kind of use is not acceptable, as is easily demonstrated by these "AI"s frequently producing exact matches without license statements. Which is unambiguously not allowed by many licenses.
Demonstrations that AIs can spit out exact copies are helpful, but misleading; that could lead down the road of "we put in filters so it can't ever emit an exact copy", and that's not sufficient. It's also a license violation to train an AI on Open Source code, generate "new" code from that model even if it's not an exact copy, and ignore the licenses of the input.
License violations don't suddenly become acceptable just because you're violating a million licenses at once.
It's also a license violation to train an AI on Open Source code, generate "new" code from that model even if it's not an exact copy, and ignore the licenses of the input.
That's not exactly a given that we can simply take as true. Of course that's borderline a trite tautology about any legal issue, but I'd argue that this is even fuzzier than usual. If a human writes some code, after having seen a given corpus of code previously, the "new" code might or might not be a derivative work of that corpus. It's not clear that replacing the human with an AI somehow changes the equation so categorically that it becomes automatic to consider the output of the AI a derivative work.
License violations don't suddenly become acceptable just because you're violating a million licenses at once.
No, but if either a human or an AI emits a given line of code, and that line of code can't be shown to have been cribbed from some corpus of existing code, or to be substantially similar to such, then why wouldn't it be considered original work in both cases?
> It's not clear that replacing the human with an AI somehow changes the equation so categorically that it becomes automatic to consider the output of the AI a derivative work.
See below: there are good reasons for an AI LLM to be considered categorically different than a human for copyright purposes.
> No, but if either a human or an AI emits a given line of code, and that line of code can't be shown to have been cribbed from some corpus of existing code, or to be substantially similar to such, then why wouldn't it be considered original work in both cases?
For a work produced by a human, the burden of proof is on someone claiming that the work is a derivative work of something the human read. And in general, humans without photographic memories or a specific work open in front of them don't tend to have the ability to produce any works verbatim, though some might be able to produce sufficiently similar works to raise questions of whether they're derived works. There's also a certain unstated presumption that human learning (as opposed to human memorization or copying) doesn't constitute a derivative work, and relatedly, that a human brain isn't copyrightable so it can't be a derivative work of anything. That unstated presumption likely also touches on unstated core values about human brains, creativity, and the obvious fact that everything a human does (including the creation of creative works) is based on that human's experiences. If you write a book, you've learned from all the books you've read, but that doesn't make your book a derivative work of every book you have ever read; if people saw that outcome, they'd consider copyright law incorrect rather than accepting it.
An AI LLM, on the other hand, is (unless some court or law changes this) a derivative work of its training data. If you take off any after-the-fact filters for "don't generate a copy of any of the training data", an AI LLM can easily recite its training data, providing further evidence that the AI LLM is a derivative work of that data. The burden of proof is easily met. An AI LLM does have a photographic memory. An AI LLM hasn't just learned ideas about what makes a good book, it has learned the complete text of an extensive number of books. And there's no particular reason for us to have any of the same values about human learning apply to an AI LLM, not least of which because an AI LLM is in fact copyrightable and self-evidently a derivative work.
An AI LLM, on the other hand, is (unless some court or law changes this) a derivative work of its training data.
because an AI LLM is in fact copyrightable and self-evidently a derivative work.
I mean, that's a fine opinion to hold, and you might be right. But so far all you've done is repeat yourself and appeal to "self-evident" which isn't a terribly strong argument.
I'll wait for some actual precedent / case-law to solidify my own opinion. As it stands, I can see both sides of the argument, but I don't think the conclusion is as obvious as some folks in this discussion seem to find it. shrug
an AI LLM can easily recite its training data, providing further evidence that the AI LLM is a derivative work of that data.
OK, I can buy that to a point, so far as arguing that the LLM itself is a derivative work. But I'm not convinced that, in turn, the output of the LLM is also a derivative work in those cases where what it returns is not an exact copy (or even nearly exact copy) of anything in the training corpus.
Clarifying: the part I'm arguing is "self-evident" is that an LLM is a derivative work of its training data, in the same sense that if you copy the text of a million books into a data file and compress that file reversibly in a way that lets you get most or all of them back out again, the result is clearly a derivative work of those books. That part I made a case for, and it seems like from your last paragraph you agree with that part of the argument.
(By contrast, I wouldn't say it's self-evident that a database of blake3 hashes would be a derivative work (leaving aside that it'd probably be fair use), nor is it self-evident that compiling a million books into a Bloom filter that can recognize any random sentence but not output any random sentence would make the Bloom filter a derivative work. I think the unfiltered LLM being able to output near-verbatim copies of parts of the training set makes that case evident.)
I agree that the second step, of the output of the LLM being a derivative work of the LLM, is less obvious. And I agree that it's going to take case law before people are sure of the answer to that part. I hope the answer is "yes", and I think it'd do substantial harm to Open Source if the answer is a definitive "no".
Fair enough. I think the distinction between "the model weights" and "the output of the model" was a little blurred when this first started. Sounds like we're closer to "in agreement" than not for the most part.
> License violations don't suddenly become acceptable just because you're violating a million licenses at once.
They might actually, at least in the US. I'm not sure how the laws and judgements are going happen/change in the future, but it's possible there will be some "quanta" of copyrighted work so that any fragment smaller than that will get rounded down to zero, so even if a work was 100% made from 1_000_000 "fragments", and somehow you could figure that out from the model, the result would be considered 0% derived/copyrighted, as well as being 0% copyrightable, as it's AI generated.
Can someone convince me this article is not saying the following:
"Your outdated, overreaching open source license doesn't permit someone else to nipping your code without attribution in training their code-genearating AI. You need to get with the program."
For what benefit? Or else what? People who use the software in the normal way will move on to some other software?
LLMs need to be augmented to provide insights about the material in their training data that is relevant to fragments of their generated responses. This would be valuable for many reasons. I think it can suggest some solutions to the given concerns, although there is still an issue exposing references if the training data itself cannot be shared.
1. At what point an intelligence trained with copyrighted work is derivative work of the trained materials?
2. Why making a difference between AI and HI (Human Intelligence)?
3. Given the fast development in the field, when does the difference made above (if any) start being outdated and unrealistic and how do we future-proof against this?
> 2. Why making a difference between AI and HI (Human Intelligence)?
Regardless of perhaps more philosophical differences around whether something can or can't create something new, there's a practical difference.
Humans learn slowly, and can't be replicated. AIs can be trained once and used in a billion places. The speed and replication makes things different in a very practical sense, even if there's no clear line between them.
What's the difference between using a pencil to write something and using an LLM to write something? Seriously, I'm asking the question. Why does one produce something copyrighted why the other doesn't?
The copyright office has issued guidance on this which contains a very thorough and thoughtful legal analysis; you would probably be most interested section 3: https://copyright.gov/ai/ai_policy_guidance.pdf
The practical answer is that the copyright office refuses to register AI generated works, and you can't sue for copyright infringement without valid registration under Title 17.
> What's the difference between using a pencil to write something and using an LLM to write something?
The pencil is not a derivative work of a pile of copyrighted material.
> Why does one produce something copyrighted why the other doesn't?
There's existing case law that non-human entities (e.g. animals) can't create copyrightable works. And in the case of an AI LLM, the AI LLM itself is a derivative work of its training data (as evidenced by the fact that it can by default spit out training data verbatim, even if it has had after-the-fact filters added to prevent such responses).
Re: 1, As far as i can tell it’s automatically a derivative work, but there’s a case to be made that it’s fair use (i.e. it doesn’t matter that it’s a derivative work).
agreed...at what point should I provide remuneration to my professors? Should those professors / staff provide royalties upstream? I fully agreed with citation _but_ to claim that AI is derived work / needs to return royalties based on the materials it learnt from seems a step too far IMHO. It read material and put it back like everyone else.
Limiting US and EU companies with policies to use data for AI will lead to the situation when other countries will get the lead by using that policed data in their AI and get competitive advantage. It will be very hard to find out on which data AI was trained.
Have you watched movies or shows or read books you were glad existed? Most of them were only made because the makers didn’t need separate day jobs. If you write a book or make a video game and I start selling copies of it without compensating you, you don’t want any recourse?
Jeez a classic "the register" article full of "concerns", but no solutions. To be honest I'm growing tired of all these "AI will cause problems with X" articles that don't present any kind of solution.
We all know the issues with AI generated code. Unless you're doing absolute boilerplate code (getters/setters in java, defining interfaces for existing implementations etc) AI is worse than useless... Why is it worse than useless? Because it pretends to solve your problem while introducing hidden failure modes. Lete give you an example.
I wanted to evaluate Chatgpt so I asked it the following question (paraphrasing)
- "can one set up alerting based on url request retrieval result in aws without servers".
- It answers "certainly, you just need to create an aws lambda function, then register It with synthetic canary feature of cloud watch, set up alarm and it's done (a list of exact steps follows)".
On the surface this sounds plausible so I decided to go along. I tell it to provide the code for the lambda function in python. It did produce basic code that retrieves a url and exits returning true/false. However, one can't register existing AWS Lambda functions using just python with synthetic canary. One has to open synthetic canary setup and create a new function there that uses a special execution environment that includes a chromium browser, selenium (For web Use automation) etc. This was contrary to the instructions. So that's a fail no 1. Also if I can run a much cheaper pure python env. Why would I use selenium/chromium.
So I ask it, "adjust your recommendation not to use synthetic canary feature". It responds "certainly, you just have to alert on the metric, here is a new python code that submits a metric, remember to update your IAM role to give it permission to put cloud watch metrics". Wow, I think, that's a pretty nice comprehensive answer.
But then I look at the code and it basically requests the url, has a timing thing surrounding it(I asked to include latency metric) then submits two metrics, latency and "worked" that is 1 if we get http 200,and 0 on anything else. Theoretically fine, but what if the website disappeared completely? Well then the lambda function would just timeout never submitting it's result and the alert chatgpt proposed wouldn't catch it as it was configured to treat missing data as missing. When I point that out and ask for max latency timeout Chatgpt says "sorry, you're 100% correct,let me adjust my answer" and it does putting a try/catch around it handling the error and timeout.
So following that I wonder how many people could be caught causing huge problems to themselves by putting such code into use without understanding how it works.
Does it make lives of people that know how to code, but would have to Google first how to do some specific thing? I'm not sure. I'd rather see a stackoverflow answer with an example code doing something similar, then make my own than get an answer that contains such obvious bugs.
Still I'm a big fan of using AI, just not for writing code.
Therefore, the resulting model should abide by all the license requirements imposed on the original - for example, if the model is trained on GPL code and can generate code, then any binary distribution should also be freely available for derivation in source format, and that includes all the algorithmically compressed training material (weights), which has become part of the model. If the source is AGPL, then that service cannot be made available on a website without disclosing said source and respective model weights.
Any other interpretation of the nature of copyright - which by definition, only covers human produced material - is just a variant of the proverbial "man that can't understand something because their paycheck depends upon them not understanding".