Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's path dependence [0]. When all of those were conceived in the nineties, 2-byte UCS-2 seemed to be enough to store all unicode code points.

UTF-16 came only later, once it was clear 65535 code points is too few.

Had those languages been designed in last 10 years, all of them would pick UTF-8 as their code point format.

[0]: https://en.wikipedia.org/wiki/Path_dependence



Some JavaScript runtimes (Firefox's Spidermonkey for one) have an optimization that stores some strings in single-byte format where possible to mitigate the cost of the awful original choice to use UCS-2 for JS strings. I expect some other runtimes do this too, but I don't know any off-hand.

IIRC this was motivated by Firefox OS (strings eat up a lot of RAM on memory-starved $50 smartphones) but it pays off on desktops too.


Python as of 3.3 uses any of three different internal storage mechanisms for strings: 1-byte (latin-1), 2-byte (UCS-2) or 4-byte (UCS-4) depending on the width of the highest code point in the string. This allows the internal storage to always be fixed-width, while still saving space for strings which contain, say, only code points representable in a single byte.

Prior to 3.3, the internal storage of Unicode was determined by a flag during compilation of the interpreter; a "narrow" compiled interpreter would use 2-byte strings with surrogate pairs for non-BMP code points, and a "wide" compiled interpreter would use 4-byte strings.


V8 and JavaScriptCore do it, I believe.


Java has taken a couple different shots at this, going back a decade or more, and the newer option is currently enabled in Java 9.

Some background: https://stackoverflow.com/q/8833385/149138




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: