He has written and uses (at least for the last edition) a Racket library / langu...

gwern · on April 5, 2021

Incidentally, this is no longer necessary. Gwern.net used the same trick but we removed it last week (to much rejoicing).

The main reason to auto-add soft hyphens is because, unlike almost all other browsers, desktop Chrome has for decades not supported hyphenation (even though mobile Chrome does!). But they finally shipped support last year, and now market support for hyphenation is around 95%+ according to CanIUse. So you can just drop the soft hyphen pass and rely on normal CSS to specify justification.

This is good because it simplifies HTML creation, makes the HTML noticeably smaller & better compressing, makes it more readable, search/replace more reliable, buggy screen readers no longer pronounce soft hyphens (another real WTF moment for me), doesn't require hacks like the JS copylistener to strip them out...

Anyway long story short, if you've been using the soft hyphen trick, I suggest revisiting the decision now.

gnicholas · on April 5, 2021

How does it decide where is appropriate? Is it based on the length of the potential fragments, the length of the word, some sort of ratio, or something else?

I'd rather have a hyphenation algorithm err on the side of fewer hyphenations, but perhaps that's just me. My understanding is that a ragged right edge can improve visual tracking ability because it makes the paragraph less visually uniform. No one wants things to be too ragged, but in this case two chars doesn't seem like it's worth the tradeoff.

RobertKerans · on April 5, 2021

Uses Knuth/Liang hyphenation algorithm, which prioritises prevention of incorrect hyphens over perfect output (which is not practical -- dictionary size has to be limited). It's never going to be completely correct, the trade-off for having a few weird hyphenations is that overall it's generally ok, and yes, the alternative is ragged right. By default minimum length is two, and I assume the output is using that (haven't checked).

Edit: found the overview from the docs -- https://docs.racket-lang.org/hyphenate/index.html

And Liang's thesis (PDF): https://tug.org/docs/liang/liang-thesis.pdf

Edit edit: from memory (and it's been about 9 years since I last worked in print so a bit foggy), for print it's generally best to set either 2 or 3 as a minimum: 4 is pushing it and tends to cause issues. With print, set using the hyphenation algo then worked forward from the start of the book fixing any spacing issues. With web can't do latter and width is changeable, so there are always going to be issues with justification. I can't imagine any situation where the browser could operate fast enough to render accurately hyphenated text without using a huge amount of resource.