You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2020/02/17 20:37:00 UTC
[jira] [Commented] (LUCENE-9231) fix algorithmic worst-case in
regeneration of URL tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038616#comment-17038616 ]
Robert Muir commented on LUCENE-9231:
-------------------------------------
cc [~dweiss] [~sarowe] . I haven't looked at the code or dug in much so far. Only wondering, maybe its a situation where we can sort things first to allow it to run faster (similar to the Daciuk/Mihov builder and FST.Builder in lucene)
> fix algorithmic worst-case in regeneration of URL tokenizer
> -----------------------------------------------------------
>
> Key: LUCENE-9231
> URL: https://issues.apache.org/jira/browse/LUCENE-9231
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Robert Muir
> Priority: Major
>
> For the UAX29URLEmailTokenizer, the regeneration task is slow. It also requires a very large amount of heap space (I just increased mine after seeing it struggle under GC).
> Maybe we can dig into the worst case and figure out what is happening, it seems to be an automaton issue:
> {noformat}
> "main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s tid=0x00007fb1d4018000 nid=0x19706 runnable [0x00007fb1db3df000]
> java.lang.Thread.State: RUNNABLE
> at jflex.StateSet.add(StateSet.java:218)
> at jflex.NFA.closure(NFA.java:387)
> at jflex.NFA.epsilonFill(NFA.java:410)
> at jflex.NFA.complement(NFA.java:737)
> at jflex.NFA.insertNFA(NFA.java:1029)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.insertNFA(NFA.java:1029)
> at jflex.NFA.insertNFA(NFA.java:972)
> at jflex.NFA.insertNFA(NFA.java:987)
> at jflex.NFA.insertNFA(NFA.java:988)
> at jflex.NFA.insertNFA(NFA.java:987)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.insertNFA(NFA.java:1041)
> at jflex.NFA.insertNFA(NFA.java:987)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.addRegExp(NFA.java:151)
> at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
> at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
> at jflex.LexParse.do_action(LexParse.java:939)
> at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
> at jflex.Main.generate(Main.java:73)
> at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)
> {noformat}
> Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath.
> Feels like something has a bad runtime, I wonder if we can fix it (or at least make it better, e.g. check for some GB ram heap minimum, print a warning how long it will take, etc)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org