> The model is trained in bfloat16 on 1T tokens of code (~200B tokens over 5 epo...

> The model is trained in bfloat16 on 1T tokens of code (~200B tokens over 5 epochs, including linear cooldown) for 30 programming languages from a subset of permissively licensed code from Bigcode's Stack Dedup V2 dataset and a dev-oriented samples from StackExchange.

Following the link to the "Stack Dedup V2" page: https://huggingface.co/datasets/bigcode/the-stack-dedup

> The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages. The full list can be found here.

https://huggingface.co/datasets/bigcode/the-stack-dedup/blob...

It requires login to see the JSON file.