You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kevin Lawson (JIRA)" <ji...@apache.org> on 2013/05/03 18:12:18 UTC

[jira] [Comment Edited] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata

    [ https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648534#comment-13648534 ] 

Kevin Lawson edited comment on LUCENE-4947 at 5/3/13 4:11 PM:
--------------------------------------------------------------

I've attached an archive of the most recent version of LevenshteinAutomaton, which includes transpositions in the edit distance calculations.

LevenshteinAutomaton-master-update.zip MD5 checksum: 8187978b438435ac2d23981bb0133d34
                
      was (Author: klawson88):
    I've attached an archive of the most recent version of LevenshteinAutomaton, which includes transpositions in the edit distance calculations.

The MD5 sum of LevenshteinAutomaton-master-update.zip is 8187978b438435ac2d23981bb0133d34
                  
> Java implementation (and improvement) of Levenshtein & associated lexicon automata
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-4947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4947
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
>            Reporter: Kevin Lawson
>         Attachments: LevenshteinAutomaton-master-update.zip, LevenshteinAutomaton-master.zip, MDAG-master.zip
>
>
> I was encouraged by Mike McCandless to open an issue concerning this after I contacted him privately about it. Thanks Mike!
> I'd like to submit my Java implementation of the Levenshtein Automaton as a homogenous replacement for the current heterogenous, multi-component implementation in Lucene.
> Benefits of upgrading include 
> - Reduced code complexity
> - Better performance from components that were previously implemented in Python
> - Support for on-the-fly dictionary-automaton manipulation (if you wish to use my dictionary-automaton implementation)
> The code for all the components is well structured, easy to follow, and extensively commented. It has also been fully tested for correct functionality and performance.
> The levenshtein automaton implementation (along with the required MDAG reference) can be found in my LevenshteinAutomaton Java library here: https://github.com/klawson88/LevenshteinAutomaton.
> The minimalistic directed acyclic graph (MDAG) which the automaton code uses to store and step through word sets can be found here: https://github.com/klawson88/MDAG
> *-Transpositions aren't currently implemented. I hope the comment filled, editing-friendly code combined with the fact that the section in the Mihov paper detailing transpositions is only 2 pages makes adding the functionality trivial.- Update introduces transposition inclusion in edit distance calculations!
> *As a result of support for on-the-fly manipulation, the MDAG (dictionary-automaton) creation process incurs a slight speed penalty. In order to have the best of both worlds, i'd recommend the addition of a constructor which only takes sorted input. The complete, easy to follow pseudo-code for the simple procedure can be found in the first article I linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org