You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2018/03/22 14:46:00 UTC

[jira] [Assigned] (LUCENE-8219) LevenshteinAutomata should estimate the number of states and transitions

     [ https://issues.apache.org/jira/browse/LUCENE-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright reassigned LUCENE-8219:
-----------------------------------

    Assignee: Karl Wright

> LevenshteinAutomata should estimate the number of states and transitions
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-8219
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8219
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Christian Ziech
>            Assignee: Karl Wright
>            Priority: Major
>
> Currently the toAutomaton() method of the LevenshteinAutomata class uses the default constructor of the Automaton although it exactly knows how many states the automaton will have and can do a reasonable estimation on how many transitions it will need as well.
> I suggest changing the lines {code:language=java|firstline=154|linenumbers=true}
> // the number of states is based on the length of the word and n
> int numStates = description.size();
> Automaton a = new Automaton();
> int lastState;
> {code}
> to
> {code:language=java|firstline=154|linenumbers=true}
> // the number of states is based on the length of the word and n
> final int numStates = description.size();
> final int numTransitions = numStates * Math.min(1 + 2 * n, alphabet.length);
> final int prefixStates = prefix != null ? prefix.codePointCount(0, prefix.length()) : 0;
> final Automaton a = new Automaton(numStates + prefixStates, numTransitions);
> int lastState;
> {code}
> For my test data this cut down on the total amount of memory needed for int arrays by factor 4. The estimation of "1 + 2 * editDistance" should maybe rather be replaced by a value coming from the ParametricDescription used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org