Genuine question here; not trying to be snarky. How is AI "reading" code differe...

cgearhart · on June 24, 2023

Part of what LLMs do is compress their training dataset into the weights, often with character-perfect recall later. For example, I would be shocked if any sufficiently large LLM failed when prompted “write the quake fast inverse square root algorithm verbatim”.

(I’m not really interested in arguing whether that’s all they do, or whether it’s the purpose of LLMs—those details are just a distraction from the original question: what makes LLM training different than a human reading code.)

If the model has memorized the training set and can reproduce it verbatim when prompted, then it should be incumbent on the AI owner to prove that it does not reproduce copyrighted code when it is not explicitly prompted.

IshKebab · on June 25, 2023

So it's just about accuracy of recall then, not use of training data?

I think the most likely outcome will be to treat AI just like people. They're allowed to learn from any code they can see, but that doesn't mean that if they reproduce a copy from memory that it is somehow free of its original copyright.

That's very consistent with how copyright law already works.

This will leave AI users in a sightly awkward position where they are responsible for figuring out if they unknowingly used AI to unknowingly copy code, but it's not like that can't happen already - as soon as you hire a programmer you might be unknowingly allowing copied code into your product.

cgearhart · on June 25, 2023

No, I don’t think it’s just a question of recall accuracy. They issue really hinges on whether or not the AI itself is a derivative work of the training data, as I think that would trigger certain requirements in the original source licenses. Lots of folks seem to think that it is not a derivative work because (a) the model is just a bunch of numeric weights, it doesn’t contain any explicit code; and (b) it’s possible for the model to output original code in some cases. But that’s flawed reasoning because it’s quite clear that the model weights do contain perfect copies of at least some training code, and the models can produce that code perfectly (without the original license) when prompted. Thus it seems clear that the model itself should be treated as a derivative work, whereas a human is not—even if they memorize the code they read.

IshKebab · on June 25, 2023

Why is a human not though? I don't think it's as simple as you imagine. A human who has memorized the information contains it just as much as the weights.

cgearhart · on June 25, 2023

Both human and LLM may learn from reading code to produce novel, derivative, or duplicative work—but that’s not the issue, because the model itself is a derivative of the training data and the human is not. That does seem very simple to me.

If we just zipped up the entire training data set and distributed it with the model then it would clearly be a copy and/or derivative work. The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights). Folks just seem to think that it’s not a derivative work because an LLM _also_ does more than that sometimes (e.g., extrapolates from the training data to produce novel token sequences as output).

IshKebab · on June 25, 2023

> the human is not

Why not? Humans store information in their brains that they have learnt. So do AIs. What exactly is the difference between a weight in an Artificial Neural Network and a weight in a Natural Neural Network?

If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.

> The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights).

It's not at all the same. It's highly lossy. Only extremely highly repeated works get memorised exactly and even then it's often not exact.

LLMs do not contain a copy of all the training data (if trained properly). I agree if that was the case then it would be different, but that isn't how they work (unless you badly overfit).

JoshTriplett · on June 25, 2023

> If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.

That's absolutely the difference. Humans aren't copyrightable; the alternative would be unconscionable.

> even then it's often not exact

You don't have to copy something exactly to be a derivative work. "Lord of the Rings but a random 15% of words are replaced with gibberish" is still a derivative work of Lord of the Rings. So is "Lord of the Rings but every word/sentence is paraphrased".

cgearhart · on June 26, 2023

An LLM contains some portion of the training data exactly and the rest of it lossily. What I’m really arguing is that _alone_ that is enough to make the model itself a derivative work. It actually doesn’t matter whether that’s the same or different than a human; that’s a distraction. The AI model is itself a work that is derived from the training data.

lisasays · on June 24, 2023

How is AI "reading" code different from me reading code?

By doing so with the explicit intent of building derivative products from it, and at massive scale.

hgs3 · on June 24, 2023

Because many programmers who open sourced their code intended it to be read by humans, not AI. They don't want some centralized super computer owned by a mega-corporation reading their code. At the very least, if the models were Free and Open Source, the reaction might be different.

g_delgado14 · on June 25, 2023

If you host your code on GitHub and the like, then your code is already being read by a centralized mega-corp.

noirscape · on June 24, 2023

Basically, the difference is that you merely reading code doesn't create a derivative work that we can meaningfully look at. Yes, it gets stored in your brain but your brain re-encodes all that knowledge in a way only it can use. We're still quite a bit away from brain uploading at the moment, so that's not a meaningful avenue to discuss right now.

An LLM on the other hand generally works off of a model that was trained first, and that model can be saved to a file and read out later. As a result, it's a derivative work that we can examine, copy, share, modify and do all the things with that we generally attribute to something being a Work. The question on if binary output from a program can be copyrighted is somewhat unclear, but from what I've heard legally (not legal advice, I Am Not A Lawyer), it seems to be the case unless you explicitly say it's not[0].

There's a few other things to consider like how you, as a human, can make the conscious decision to avoid specifically replicating GPL code that you've seen if you're not allowed to use it (whether that is by restructuring the code, doing the same techniques in a different language, or the heaviest example which is clean-rooming it). AIs don't have the ability to make that distinction (and to my understanding due to how they work, the only way you can meaningfully avoid it is if you ensure that the entire model is compliant to avoid the AI going off on it's own tangent and making the decision to include incompatible code.)

From a more practical perspective - Copilot will happily spit out and apply the wrong license to Quake IIIs fast inverse square root algorithm function. It's GPL licensed code but it IIRC claimed it was BSD licensed? That alone would constitute a violation and it'd be weird to not point at the people who trained the model that allowed it to make that choice.

To be fair, right now a lot of this is up in the air and all we have to go on is kinda wishy-washy guidance from copyright offices (which is mostly just refusing registration on the basis that a copyrighted material has to be made by a human, not by a machine). There's a couple of ongoing lawsuits specifically about Copilot that are still pending and from what I last heard, the judges aren't very impressed by the defense of GitHub/MSFT/OpenAI. The approach also greatly differs per country/governing body - Japans government has for example given blanket permission for non-commercial AI training, while keeping a strict eye on anyone trying to use it for paid services, while the EU is passing legislation that seems to mostly lean towards "it's copyrighted, that's now your problem to get in line with it", without outright saying it yet.

[0]: This is the main reason why for FOSS, the Creative Commons License usually is not seen as a good pick outside of assets, because it can interfere with distributing compiled versions of your code.

kouteiheika · on June 24, 2023

> Japans government has for example given blanket permission for non-commercial AI training, while keeping a strict eye on anyone trying to use it for paid services

This is incorrect; it doesn't matter whether it's commercial or non-commercial, and you can use anything as training data regardless of copyright. See the amendment of the copyright law from 2018.