Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.


Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.

“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”

"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: