Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often. Speculative execution (counting in two or three ways simultaneously) might mitigate the performance hit.


> In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often

You wouldn't want to process a single code point (or unit) at a time anyways, but 16, 32 or 64 code units (or bytes) at once.

That UTF-8 strlen I wrote had no mispredicts, because it was vectored.

Indexing is slow, but the difference to UTF-16 is not significant.

I guess locale based comparisons or case insensitive operations could be slow, but then again, they'll need a slow array lookup anyways.

Which string operation(s) are you talking about?


You don't need to check the representation doing anything specifically with spaces or newlines. All 0x0A bytes are newline characters in UTF8 and all 0x20 bytes are spaces.

The only place you really need to decode UTF8 characters is when you convert it to another format (which you hopefully won't need to do anymore in the far future) or display it (where the decoding is a minuscule factor in performance)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: