Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Both human and LLM may learn from reading code to produce novel, derivative, or duplicative work—but that’s not the issue, because the model itself is a derivative of the training data and the human is not. That does seem very simple to me.

If we just zipped up the entire training data set and distributed it with the model then it would clearly be a copy and/or derivative work. The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights). Folks just seem to think that it’s not a derivative work because an LLM _also_ does more than that sometimes (e.g., extrapolates from the training data to produce novel token sequences as output).



> the human is not

Why not? Humans store information in their brains that they have learnt. So do AIs. What exactly is the difference between a weight in an Artificial Neural Network and a weight in a Natural Neural Network?

If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.

> The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights).

It's not at all the same. It's highly lossy. Only extremely highly repeated works get memorised exactly and even then it's often not exact.

LLMs do not contain a copy of all the training data (if trained properly). I agree if that was the case then it would be different, but that isn't how they work (unless you badly overfit).


> If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.

That's absolutely the difference. Humans aren't copyrightable; the alternative would be unconscionable.

> even then it's often not exact

You don't have to copy something exactly to be a derivative work. "Lord of the Rings but a random 15% of words are replaced with gibberish" is still a derivative work of Lord of the Rings. So is "Lord of the Rings but every word/sentence is paraphrased".


An LLM contains some portion of the training data exactly and the rest of it lossily. What I’m really arguing is that _alone_ that is enough to make the model itself a derivative work. It actually doesn’t matter whether that’s the same or different than a human; that’s a distraction. The AI model is itself a work that is derived from the training data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: