Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Commenting to follow, curious about the answer.

From what I've found through Google (with no real understanding of llm) 2^16 is the max tokens per minute for fine tuning OpenAI's models via their platform. I don't believe this is the same as the training token count.

Then there's the context token limit, which is 16k for 3.5 turbo, but I don't think that's relevant here.

Though somebody please tell me why I'm wrong, I'm still trying to wrap my head around the training side.



You are right to be curious. The encoding used by both GPT-3.5 and GPT-4 is called `cl100k_base`, which immediately and correctly suggests that there are about 100K tokens.


Amazing, thanks for the reply, I'm finding some good resources afyer a quick search of `cl100k_base`.

If you have any other resources (for anything AI related) please share!


Their tokenizer is open source: https://github.com/openai/tiktoken

Data files that contain vocabulary are listed here: https://github.com/openai/tiktoken/blob/9e79899bc248d5313c7d...


GPT 2 and 3 used the p50K right? Then GPT-4 used cl100K





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: