You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Nik Everett (JIRA)" <ji...@apache.org> on 2014/11/03 17:36:34 UTC

[jira] [Updated] (LUCENE-6046) RegExp.toAutomaton high memory use

     [ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nik Everett updated LUCENE-6046:
--------------------------------
    Attachment: LUCENE-6046.patch

First cut at a patch.  Adds maxDeterminizedStates to Operations.determinize and pipes it through to tons of places.  I think its important never to hide when determinize is called because of how potentially heavy it is.  Forcing callers of MinimizationOperations.minimize, Operations.reverse, Operations.minus etc to specify maxDeterminizedStates makes it pretty clear that the automaton might be determinized during those processes.

I added an unchecked exception for when the Automaton can't be determinized within the specified number of state but I'm really tempted to change it to a checked exception to make it super duper obvious when determinization might occur.

> RegExp.toAutomaton high memory use
> ----------------------------------
>
>                 Key: LUCENE-6046
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6046
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>    Affects Versions: 4.10.1
>            Reporter: Lee Hinman
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-6046.patch
>
>
> When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java.
> The following caused an OutOfMemoryError with a 32gb heap:
> {noformat}
> new RegExp("\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}").toAutomaton();
> {noformat}
> When increased to a 60gb heap, the following exception is thrown:
> {noformat}
>   1> java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623)
>   1>     __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
>   1>     org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
>   1>     org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
>   1>     org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
>   1>     org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
>   1>     org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
>   1>     org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
>   1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
>   1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org