You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2020/02/17 20:33:00 UTC
[jira] [Created] (LUCENE-9231) fix algorithmic worst-case in
regeneration of URL tokenizer
Robert Muir created LUCENE-9231:
-----------------------------------
Summary: fix algorithmic worst-case in regeneration of URL tokenizer
Key: LUCENE-9231
URL: https://issues.apache.org/jira/browse/LUCENE-9231
Project: Lucene - Core
Issue Type: Wish
Reporter: Robert Muir
For the UAX29URLEmailTokenizer, the regeneration task is slow. It also requires a very large amount of heap space (I just increased mine after seeing it struggle under GC).
Maybe we can dig into the worst case and figure out what is happening, it seems to be an automaton issue:
{noformat}
"main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s tid=0x00007fb1d4018000 nid=0x19706 runnable [0x00007fb1db3df000]
java.lang.Thread.State: RUNNABLE
at jflex.StateSet.add(StateSet.java:218)
at jflex.NFA.closure(NFA.java:387)
at jflex.NFA.epsilonFill(NFA.java:410)
at jflex.NFA.complement(NFA.java:737)
at jflex.NFA.insertNFA(NFA.java:1029)
at jflex.NFA.insertNFA(NFA.java:971)
at jflex.NFA.insertNFA(NFA.java:1029)
at jflex.NFA.insertNFA(NFA.java:972)
at jflex.NFA.insertNFA(NFA.java:987)
at jflex.NFA.insertNFA(NFA.java:988)
at jflex.NFA.insertNFA(NFA.java:987)
at jflex.NFA.insertNFA(NFA.java:971)
at jflex.NFA.insertNFA(NFA.java:1041)
at jflex.NFA.insertNFA(NFA.java:987)
at jflex.NFA.insertNFA(NFA.java:971)
at jflex.NFA.insertNFA(NFA.java:971)
at jflex.NFA.addRegExp(NFA.java:151)
at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
at jflex.LexParse.do_action(LexParse.java:939)
at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
at jflex.Main.generate(Main.java:73)
at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)
{noformat}
Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath.
Feels like something has a bad runtime, I wonder if we can fix it (or at least make it better, e.g. check for some GB ram heap minimum, print a warning how long it will take, etc)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org