Why should the money go the to code authors in the first place? All training dat...

Macha · on June 21, 2022

Unless something has changed, the training data also includes copyleft code, not just permissively licensed code

visarga · on June 21, 2022

Regarding the training of the model - I don't think a copyright can restrict reading, and training is reading, not distributing any original data.

About deploying the model - it just needs to filter out verbatim exact snippets so it only outputs original, unattributable code. That can be done by hashing ngrams and a bloom filter. The vast majority of code generated by Codex is original anyway.

By the way, Codex is good for many other tasks, like, parsing the fields of a receipt, or extracting the summary of an email, or generating baby names, it's an all purpose NLP tool. Just call it like a function. Code completion is just one thing it does. It talks pretty great English, can compose poems.

CryZe · on June 21, 2022

> it just needs to filter out verbatim exact snippets so it only outputs original, unattributable code.

That's a setting now.

mtlynch · on June 21, 2022

>All training data is available under permissive licenses. Assuming you're not overfitting on specific code sequences (which would require attribution - and yes, I'm aware Copilot is not immune to this problem and it needs fixing), I'd say this is fair play.

Copilot isn't honoring the license, so why does it matter whether it was under a restrictive or permissive license?