Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah this is a shortcoming of the tokenizer. It splits things up in ways that are not 1:1 mappable back to their source unfortunately.

I did a bit of post-processing to get it formatted a bit better (re-combining the “would“ and “n’t” tokens and changing html tags to markdown for example) but there’s still room for improvement.

Spacing specifically is different based on the context. Outside of code blocks you want a space after a period. Inside you probably don’t. But since the tokenizer has one in both places there’s no opportunity for the neural net to learn this (it can’t see any difference). And my naive formatted doesn’t know the difference either. (If you’re curious you can find it in the JS file)



I updated my regexes to clean up some of the tokenizer noise last night. So many of the formatting in the code snippets should look a bit more natural now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: