> The model is trained in bfloat16 on 1T tokens of code (~200B tokens over 5 epochs, including linear cooldown) for 30 programming languages from a subset of permissively licensed code from Bigcode's Stack Dedup V2 dataset and a dev-oriented samples from StackExchange.
Following the link to the "Stack Dedup V2" page: https://huggingface.co/datasets/bigcode/the-stack-dedup
> The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages. The full list can be found here.
https://huggingface.co/datasets/bigcode/the-stack-dedup/blob...
It requires login to see the JSON file.