Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Implementing Levenshtein Automata (fulmicoton.com)
100 points by fulmicoton on Sept 20, 2015 | hide | past | favorite | 8 comments


FWIW, my Dictomaton Java library also has support for Levenshtein automata (besides finite state dictionaries, perfect hash automata, and String mappings based on perfect hash automata):

https://github.com/danieldk/dictomaton


> besides finite state dictionaries, ..., String mappings based on perfect hash automata

I'm implementing a similar library in Rust. One of the key things I've found missing in Daciuk's great work on the topic is the handling of the registry. In your code[1], it looks like you're using a straight a hash table. I thought you'd like to know that I've had great success sacrificing complete DFA minimization in exchange for a small constant memory overhead by using an LRU cache of only some classes of states.

My code still needs a good deal of polish, and I intend to write a blog post, but I bet you can make sense of the general strategy from the code and reuse it.[2]

N.B. Lucene's implementation permits sacrificing complete minimization, but they don't use an LRU cache and I think still permit the registry to grow without bound.

[1] - https://github.com/danieldk/dictomaton/blob/master/src/main/...

[2] - https://github.com/BurntSushi/fst/blob/master/src/fst/regist... (The data type is a map from node hash to LRU cache. That is, there is a small LRU cache for each bin in the map.)


Interesting, thanks for the pointers! I'll look at it when we are done moving :).

BTW, for those who are interested, Jan's code is also available online:

http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciu...

http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciu...

His FSA utilities and the fadd library are used in the Alpino parser:

http://www.let.rug.nl/vannoord/alp/Alpino/


Care to share any experiences you had implementing them?


So what's the best OSS search autocomplete these days? Solr's, Elasticsearch's, Linkedin CLEO?


You can add Duck Duck go's implementation to your list https://github.com/duckduckgo/cpp-libface.


Great write-up :-) I hope you'll considering explaining that last bit of code some more later? I think there may be something missing from the parenthesis here:

     Well the trick here is to normalize our states into two parts

    * a global offset that is the minimum offset of the states
    * the shape of the shifted states (we already saw that there was around )

    With this parametric DFA, transition will tell you, given a “shape”, what shape to transition two, as well as how much you should add to the global offset.


This is a great implementation, thanks. For those of you who need this to be fast and achieve the theoretical bounds, you may want to do something like this using a finite-state transducer toolkit. Here's an example with OpenFst: http://www.openfst.org/twiki/bin/view/FST/FstExamples (halfway down under "Edit Distance"). (And OpenFst has a Python wrapper now.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: