Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not really.

UTF-8 is self-synchronizing, which means you can treat it as a byte string for most operations, including finding substrings. You don't need to convert UTF-8 to a sequence of codepoints for most tasks (particularly if you drop the insistence of using character boundaries). When you do have to do so, you're usually applying a complex Unicode algorithm like case conversion, and so the branch misprediction overhead of creating characters is likely small in comparison to the actual cost of doing the algorithm.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: