Commenting to follow, curious about the answer. From what I've found through Goo...

lifthrasiir · on Jan 22, 2024

You are right to be curious. The encoding used by both GPT-3.5 and GPT-4 is called `cl100k_base`, which immediately and correctly suggests that there are about 100K tokens.

cwsx · on Jan 22, 2024

Amazing, thanks for the reply, I'm finding some good resources afyer a quick search of `cl100k_base`.

If you have any other resources (for anything AI related) please share!

dchest · on Jan 22, 2024

Their tokenizer is open source: https://github.com/openai/tiktoken

Data files that contain vocabulary are listed here: https://github.com/openai/tiktoken/blob/9e79899bc248d5313c7d...

senseiV · on Jan 22, 2024

GPT 2 and 3 used the p50K right? Then GPT-4 used cl100K

lifthrasiir · on Jan 22, 2024

Yeah, see [1].

[1] https://github.com/openai/tiktoken/blob/main/tiktoken/model....