You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2020/02/17 20:37:00 UTC
[jira] [Commented] (LUCENE-9231) fix algorithmic worst-case in regeneration of URL tokenizer

    [ https://issues.apache.org/jira/browse/LUCENE-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038616#comment-17038616 ] 

Robert Muir commented on LUCENE-9231:
-------------------------------------

cc [~dweiss] [~sarowe] . I haven't looked at the code or dug in much so far. Only wondering, maybe its a situation where we can sort things first to allow it to run faster (similar to the Daciuk/Mihov builder and FST.Builder in lucene)

> fix algorithmic worst-case in regeneration of URL tokenizer
> -----------------------------------------------------------
>
>                 Key: LUCENE-9231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9231
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Robert Muir
>            Priority: Major
>
> For the UAX29URLEmailTokenizer, the regeneration task is slow. It also requires a very large amount of heap space (I just increased mine after seeing it struggle under GC).
> Maybe we can dig into the worst case and figure out what is happening, it seems to be an automaton issue:
> {noformat}
> "main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s tid=0x00007fb1d4018000 nid=0x19706 runnable  [0x00007fb1db3df000]
>    java.lang.Thread.State: RUNNABLE
> 	at jflex.StateSet.add(StateSet.java:218)
> 	at jflex.NFA.closure(NFA.java:387)
> 	at jflex.NFA.epsilonFill(NFA.java:410)
> 	at jflex.NFA.complement(NFA.java:737)
> 	at jflex.NFA.insertNFA(NFA.java:1029)
> 	at jflex.NFA.insertNFA(NFA.java:971)
> 	at jflex.NFA.insertNFA(NFA.java:1029)
> 	at jflex.NFA.insertNFA(NFA.java:972)
> 	at jflex.NFA.insertNFA(NFA.java:987)
> 	at jflex.NFA.insertNFA(NFA.java:988)
> 	at jflex.NFA.insertNFA(NFA.java:987)
> 	at jflex.NFA.insertNFA(NFA.java:971)
> 	at jflex.NFA.insertNFA(NFA.java:1041)
> 	at jflex.NFA.insertNFA(NFA.java:987)
> 	at jflex.NFA.insertNFA(NFA.java:971)
> 	at jflex.NFA.insertNFA(NFA.java:971)
> 	at jflex.NFA.addRegExp(NFA.java:151)
> 	at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
> 	at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
> 	at jflex.LexParse.do_action(LexParse.java:939)
> 	at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
> 	at jflex.Main.generate(Main.java:73)
> 	at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)
> {noformat}
> Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath.
> Feels like something has a bad runtime, I wonder if we can fix it (or at least make it better, e.g. check for some GB ram heap minimum, print a warning how long it will take, etc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org