You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2020/02/17 20:33:00 UTC

[jira] [Created] (LUCENE-9231) fix algorithmic worst-case in regeneration of URL tokenizer

Robert Muir created LUCENE-9231:
-----------------------------------

             Summary: fix algorithmic worst-case in regeneration of URL tokenizer
                 Key: LUCENE-9231
                 URL: https://issues.apache.org/jira/browse/LUCENE-9231
             Project: Lucene - Core
          Issue Type: Wish
            Reporter: Robert Muir


For the UAX29URLEmailTokenizer, the regeneration task is slow. It also requires a very large amount of heap space (I just increased mine after seeing it struggle under GC).

Maybe we can dig into the worst case and figure out what is happening, it seems to be an automaton issue:

{noformat}
"main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s tid=0x00007fb1d4018000 nid=0x19706 runnable  [0x00007fb1db3df000]
   java.lang.Thread.State: RUNNABLE
	at jflex.StateSet.add(StateSet.java:218)
	at jflex.NFA.closure(NFA.java:387)
	at jflex.NFA.epsilonFill(NFA.java:410)
	at jflex.NFA.complement(NFA.java:737)
	at jflex.NFA.insertNFA(NFA.java:1029)
	at jflex.NFA.insertNFA(NFA.java:971)
	at jflex.NFA.insertNFA(NFA.java:1029)
	at jflex.NFA.insertNFA(NFA.java:972)
	at jflex.NFA.insertNFA(NFA.java:987)
	at jflex.NFA.insertNFA(NFA.java:988)
	at jflex.NFA.insertNFA(NFA.java:987)
	at jflex.NFA.insertNFA(NFA.java:971)
	at jflex.NFA.insertNFA(NFA.java:1041)
	at jflex.NFA.insertNFA(NFA.java:987)
	at jflex.NFA.insertNFA(NFA.java:971)
	at jflex.NFA.insertNFA(NFA.java:971)
	at jflex.NFA.addRegExp(NFA.java:151)
	at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
	at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
	at jflex.LexParse.do_action(LexParse.java:939)
	at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
	at jflex.Main.generate(Main.java:73)
	at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)
{noformat}

Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath.

Feels like something has a bad runtime, I wonder if we can fix it (or at least make it better, e.g. check for some GB ram heap minimum, print a warning how long it will take, etc)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org