You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Thomas Poppe (JIRA)" <ji...@apache.org> on 2017/08/09 07:07:00 UTC
[jira] [Updated] (LUCENE-7921) More efficient way to transform a
RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Poppe updated LUCENE-7921:
---------------------------------
Description:
Consider the following example:
{code:title=ToAutomatonExample.java|borderStyle=solid}
public static void main(String[] args) {
org.apache.lucene.util.automaton.RegExp regExp =
new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
org.apache.lucene.util.automaton.Automaton automaton = regExp.toAutomaton();
System.out.println("states: " + automaton.getNumStates());
System.out.println("transitions: " + automaton.getNumTransitions());
System.out.println("-------------------------------");
try {
regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
automaton = regExp.toAutomaton();
System.out.println("Will not happen...");
} catch (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
automaton = regExp.toAutomaton(1_000_000);
System.out.println("states: " + automaton.getNumStates());
System.out.println("transitions: " + automaton.getNumTransitions());
System.out.println("-------------------------------");
}
}
{code}
Both regular expressions are equivalent, but it's much more efficient to "unroll" the repetition. It might be possible to optimize the Regex#toAutomaton() method to handle this repetition without going over the default number of determinized states, and using less memory and CPU?
was:
Consider the following example:
public static void main(String[] args) {
org.apache.lucene.util.automaton.RegExp regExp =
new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
org.apache.lucene.util.automaton.Automaton automaton = regExp.toAutomaton();
System.out.println("states: " + automaton.getNumStates());
System.out.println("transitions: " + automaton.getNumTransitions());
System.out.println("-------------------------------");
try {
regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
automaton = regExp.toAutomaton();
System.out.println("Will not happen...");
} catch (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
automaton = regExp.toAutomaton(1_000_000);
System.out.println("states: " + automaton.getNumStates());
System.out.println("transitions: " + automaton.getNumTransitions());
System.out.println("-------------------------------");
}
}
Both regular expressions are equivalent, but it's much more efficient to "unroll" the repetition. It might be possible to optimize the Regex#toAutomaton() method to handle this repetition without going over the default number of determinized states, and using less memory and CPU?
> More efficient way to transform a RegExp to an Automaton
> --------------------------------------------------------
>
> Key: LUCENE-7921
> URL: https://issues.apache.org/jira/browse/LUCENE-7921
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 6.5.1
> Reporter: Thomas Poppe
> Priority: Minor
>
> Consider the following example:
> {code:title=ToAutomatonExample.java|borderStyle=solid}
> public static void main(String[] args) {
> org.apache.lucene.util.automaton.RegExp regExp =
> new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
> org.apache.lucene.util.automaton.Automaton automaton = regExp.toAutomaton();
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + automaton.getNumTransitions());
> System.out.println("-------------------------------");
> try {
> regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
> automaton = regExp.toAutomaton();
> System.out.println("Will not happen...");
> } catch (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
> automaton = regExp.toAutomaton(1_000_000);
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + automaton.getNumTransitions());
> System.out.println("-------------------------------");
> }
> }
> {code}
> Both regular expressions are equivalent, but it's much more efficient to "unroll" the repetition. It might be possible to optimize the Regex#toAutomaton() method to handle this repetition without going over the default number of determinized states, and using less memory and CPU?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org