You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kevin Lawson (JIRA)" <ji...@apache.org> on 2013/04/22 16:55:16 UTC

[jira] [Comment Edited] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata

    [ https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638050#comment-13638050 ] 

Kevin Lawson edited comment on LUCENE-4947 at 4/22/13 2:53 PM:
---------------------------------------------------------------

bq. Alternatively, can the code be ported away from MDAC to Bricks-automaton, so it can interact with Lucene's Automaton library? If this is not the case, we can no longer easily combine wildcards/prefix/fuzzy anymore.

Of course! If you take a look at the tableFuzzySearch() method [here|https://github.com/klawson88/LevenshteinAutomaton/blob/master/final/src/com/BoxOfC/LevenshteinAutomaton/LevenshteinAutomaton.java], you'll see that it takes an MDAG (which is equivalent in structure to the automatons implemented in Brics) and simply transitions it in step with the LevenshteinAutomaton. The method can be modified easily to accept a Brics automaton, which i'm assuming has methods that implement typical automaton actions (namely transitioning and accept state determination).

The main reason one might want to consider using MDAG is that typically libraries that implement the data structure (which is more widely known as a DAWG) only support creation with sorted input (and thus, do not allow for modification). I believe Brics is  [no exception|http://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/StringUnionOperations.html]. My MDAG library supports unsorted input and run-time modification of the structure. (The minor drawback concerning this has been addressed in the original post).
                
      was (Author: klawson88):
    bq. Alternatively, can the code be ported away from MDAC to Bricks-automaton, so it can interact with Lucene's Automaton library? If this is not the case, we can no longer easily combine wildcards/prefix/fuzzy anymore.

Of course! If you take a look at the tableFuzzySearch() method [here|https://github.com/klawson88/LevenshteinAutomaton/blob/master/final/src/com/BoxOfC/LevenshteinAutomaton/LevenshteinAutomaton.java], you'll see that it takes an MDAG (which is equivalent in structure to the automatons implemented in Brics) and simply transitions it in step with the LevenshteinAutomaton. The method can be modified easily to accept a Brics automaton, which i'm assuming has methods that implement typical automaton actions (namely transitioning and accept state determination).

The main reason one might want to consider using MDAG is that typically libraries that implement the data structure (which is more widely known as a DAWG) only support creation with sorted input (and thus, do not allow for modification). I believe Brics is  [no exception|http://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/Automaton.html]. My MDAG library supports unsorted input and run-time modification of the structure. (The minor drawback concerning this has been addressed in the original post).
                  
> Java implementation (and improvement) of Levenshtein & associated lexicon automata
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-4947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4947
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
>            Reporter: Kevin Lawson
>
> I was encouraged by Mike McCandless to open an issue concerning this after I contacted him privately about it. Thanks Mike!
> I'd like to submit my Java implementation of the Levenshtein Automaton as a homogenous replacement for the current heterogenous, multi-component implementation in Lucene.
> Benefits of upgrading include 
> - Reduced code complexity
> - Better performance from components that were previously implemented in Python
> - Support for on-the-fly dictionary-automaton manipulation (if you wish to use my dictionary-automaton implementation)
> The code for all the components is well structured, easy to follow, and extensively commented. It has also been fully tested for correct functionality and performance.
> The levenshtein automaton implementation (along with the required MDAG reference) can be found in my LevenshteinAutomaton Java library here: https://github.com/klawson88/LevenshteinAutomaton.
> The minimalistic directed acyclic graph (MDAG) which the automaton code uses to store and step through word sets can be found here: https://github.com/klawson88/MDAG
> *Transpositions aren't currently implemented. I hope the comment filled, editing-friendly code combined with the fact that the section in the Mihov paper detailing transpositions is only 2 pages makes adding the functionality trivial.
> *As a result of support for on-the-fly manipulation, the MDAG (dictionary-automaton) creation process incurs a slight speed penalty. In order to have the best of both worlds, i'd recommend the addition of a constructor which only takes sorted input. The complete, easy to follow pseudo-code for the simple procedure can be found in the first article I linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org