Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anything in the range U+0800 to U+FFFF takes three bytes per character in UTF-8 and two in UTF-16 (http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings...:

"Therefore if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer then UTF-16 is more efficient. "

That same page also states: "A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, newlines, html markup, and embedded English words", but I think the "citation needed]" is added rightfully there (it may be close in many texts, though)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: