Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's the (a) way that TeX would hyphenate it:

  $ tex
  This is TeX, Version 3.14159265 (TeX Live 2016) (preloaded format=tex)
  **\relax
  *\showhyphens{typographer}
  Underfull \hbox (badness 10000) detected at line 0
  [] \tenrm ty-pog-ra-pher


Thanks, that’s interesting! So it breaks all of the morphemes? I’d prefer typo-grapher or typograph-er to the above options.

I wonder if people who are more or less visual would have different preferences. I’m visual, so perhaps I optimize for the visual appearance of morphemes as opposed to the syllable breaks?


> So it breaks all of the morphemes?

Not really. TeX uses Knuth-Liang algorithm which does not do any morphological or semantical analysis.

You basically feed it a corpus of words, it learns substrings that contain word breaks more often, condenses this information, and uses that to guess where an arbitrary word could be hyphenated.

In particular, TeX's hyphenation data contains lines

    y3po
    5po4g
which strongly suggest that when TeX sees substrings “ypo” and “?pog” they can be hyphenated as “y-po” and “?-pog”. (Odd weights allow hyphenation, even weights discourage it.)

That's why “typograher” is hyphenated like that, since it's not in the exception list.


One of my minor claims to fame is contributing the word Wertherian to the hyphanation exception list. I only knew it was an issue because I saw it mis-hyphenated in a Trollope novel.


Is there a morpheme-aware version?


Yes, with different patterns it's possible and the en-gb patterns do this (see https://www.tug.org/tex-hyphen/). Morpheme-aware hyphenation does have some oddities, though, like hyphenating helicopter as helico-pter.


I saw your webpage on programming language choice, but I couldn't figure out how to use the comment box. So here's my comment.

Rust seems like a good candidate. The runebender/druid/xi project has an active chatroom [1] full of people [2,3,...] with deep expertise in typesetting, text rendering, etc.

[1] https://xi.zulipchat.com/ [2] https://raphlinus.github.io/ [3] https://joy.recurse.com/~cmyr


I'm not sure why there was a problem with comments for you—it looks like it's correct for me, although no one else has commented so there might be an issue.

Rust is intriguing and it looks like some of the issues that are of concern to me are addressed by it (cross-platform, ability to target iOS and JNI)... I'll have to investigate a bit more. If it's got a good PDF writing library I may have a winner, although C++ does offer some really solid abilities with ICU and harfbuzz that appeal to me.


I couldn't figure out what all the boxes were for or why the big box truncated my comment at around 20 characters.


Certainly, the pain around library usage in C++ is something that would be nice to be able to avoid. A quick google search has turned up Rust bindings for harfbuzz and the print_pdf library (the latter would be superior to dealing with the scant open source C/C++ options most of which seem to have lost their primary maintainers).


TeX's default language (US English) uses the hyphenation patterns from a certain American dictionary (Liang's thesis mentions Webster’s Third New International Dictionary; probably that's the one). American hyphenation tradition tends (there are cases where different American dictionaries don't agree), more often than British, to hyphenate at phonological boundaries. (This has some interesting corollaries: the word “record” is hyphenated “re-cord” as a verb and “rec-ord” as a noun.)

British English tradition tends (more often than American) to hyphenate at morphological/etymological boundaries. For instance, if you run xetex with `\uselanguage{ukenglish}` followed by `\showhyphens{typographer}` then you get:

    ty-po-grapher
meaning that in the word “typographer”, a hyphen is allowed in those two places, instead of

    ty-pog-ra-pher
in the US English version. (Both the US English and UK English hyphenation patterns set \lefthyphenmin=2 and \righthyphenmin=3, i.e. enforce at least two letters before a hyphenation point and at least three after.)

So the hyphenation patterns have done the “right” thing in the US case, in the sense that they produced the same hyphenation as given in US dictionaries: https://www.merriam-webster.com/dictionary/typographer https://www.ahdictionary.com/word/search.html?q=typographer = https://www.thefreedictionary.com/typographer https://www.wordsmyth.net/?ent=typographer https://www.infoplease.com/dictionary/typographer (I don't have access to any UK English dictionary that shows hyphenation, as far as I can tell, but I imagine it's doing the right thing for the UK case too.)

You can see more at https://www.tug.org/tex-hyphen/ and in particular Liang's thesis https://www.tug.org/docs/liang/ is very readable (both the English and CS/data-structure parts).


Something I learnt a while back is that the dots that punctuate headwords in dictionary definitions aren't there to show you morpheme or syllable boundaries, they're there to indicate hyphenation points.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: