FWIW, my Dictomaton Java library also has support for Levenshtein automata (besides finite state dictionaries, perfect hash automata, and String mappings based on perfect hash automata):
> besides finite state dictionaries, ..., String mappings based on perfect hash automata
I'm implementing a similar library in Rust. One of the key things I've found missing in Daciuk's great work on the topic is the handling of the registry. In your code[1], it looks like you're using a straight a hash table. I thought you'd like to know that I've had great success sacrificing complete DFA minimization in exchange for a small constant memory overhead by using an LRU cache of only some classes of states.
My code still needs a good deal of polish, and I intend to write a blog post, but I bet you can make sense of the general strategy from the code and reuse it.[2]
N.B. Lucene's implementation permits sacrificing complete minimization, but they don't use an LRU cache and I think still permit the registry to grow without bound.
Great write-up :-) I hope you'll considering explaining that last bit of code some more later? I think there may be something missing from the parenthesis here:
Well the trick here is to normalize our states into two parts
* a global offset that is the minimum offset of the states
* the shape of the shifted states (we already saw that there was around )
With this parametric DFA, transition will tell you, given a “shape”, what shape to transition two, as well as how much you should add to the global offset.
This is a great implementation, thanks. For those of you who need this to be fast and achieve the theoretical bounds, you may want to do something like this using a finite-state transducer toolkit. Here's an example with OpenFst: http://www.openfst.org/twiki/bin/view/FST/FstExamples (halfway down under "Edit Distance"). (And OpenFst has a Python wrapper now.)
https://github.com/danieldk/dictomaton