Implementing Levenshtein Automata

microtonal · on Sept 20, 2015

FWIW, my Dictomaton Java library also has support for Levenshtein automata (besides finite state dictionaries, perfect hash automata, and String mappings based on perfect hash automata):

https://github.com/danieldk/dictomaton

burntsushi · on Sept 20, 2015

> besides finite state dictionaries, ..., String mappings based on perfect hash automata

I'm implementing a similar library in Rust. One of the key things I've found missing in Daciuk's great work on the topic is the handling of the registry. In your code[1], it looks like you're using a straight a hash table. I thought you'd like to know that I've had great success sacrificing complete DFA minimization in exchange for a small constant memory overhead by using an LRU cache of only some classes of states.

My code still needs a good deal of polish, and I intend to write a blog post, but I bet you can make sense of the general strategy from the code and reuse it.[2]

N.B. Lucene's implementation permits sacrificing complete minimization, but they don't use an LRU cache and I think still permit the registry to grow without bound.

[1] - https://github.com/danieldk/dictomaton/blob/master/src/main/...

[2] - https://github.com/BurntSushi/fst/blob/master/src/fst/regist... (The data type is a map from node hash to LRU cache. That is, there is a small LRU cache for each bin in the map.)

microtonal · on Sept 21, 2015

Interesting, thanks for the pointers! I'll look at it when we are done moving :).

BTW, for those who are interested, Jan's code is also available online:

http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciu...

His FSA utilities and the fadd library are used in the Alpino parser:

http://www.let.rug.nl/vannoord/alp/Alpino/

joe_the_user · on Sept 20, 2015

Care to share any experiences you had implementing them?

nxb · on Sept 20, 2015

So what's the best OSS search autocomplete these days? Solr's, Elasticsearch's, Linkedin CLEO?

fulmicoton · on Sept 20, 2015

You can add Duck Duck go's implementation to your list https://github.com/duckduckgo/cpp-libface.

_pvxk · on Sept 21, 2015

Great write-up :-) I hope you'll considering explaining that last bit of code some more later? I think there may be something missing from the parenthesis here:

     Well the trick here is to normalize our states into two parts

    * a global offset that is the minimum offset of the states
    * the shape of the shifted states (we already saw that there was around )

    With this parametric DFA, transition will tell you, given a “shape”, what shape to transition two, as well as how much you should add to the global offset.

kylebgorman · on Sept 21, 2015

This is a great implementation, thanks. For those of you who need this to be fast and achieve the theoretical bounds, you may want to do something like this using a finite-state transducer toolkit. Here's an example with OpenFst: http://www.openfst.org/twiki/bin/view/FST/FstExamples (halfway down under "Edit Distance"). (And OpenFst has a Python wrapper now.)