You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jack Krupansky (JIRA)" <ji...@apache.org> on 2014/11/27 14:48:12 UTC
[jira] [Commented] (LUCENE-6079) PatternReplaceCharFilter crashes JVM with OutOfMemoryError

    [ https://issues.apache.org/jira/browse/LUCENE-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227671#comment-14227671 ] 

Jack Krupansky commented on LUCENE-6079:
----------------------------------------

But the pattern might in fact need the entire input, such as to match the end of the input with "$".

Still, it would be nice to have an optional "chunked mode" for cases such as this (assuming that pattern doesn't end with "$"), such as input which is the full text of a multi-MB PDF file. I would suggest that such as mode be the default, with a reasonable chunk size such as 100K. There should also be an "overlap" size so that when reading the next chunk it would start matching with an overlap from the end of the previous chunk, and not perform a match that extends into the overlap area at the end of a chunk unless it is the last chunk, so that matches could be made across chunk boundaries.

Actually, it turns out that there was such a feature, with a "maxBlockChars" parameter, but it was deprecated long ago - no mention in CHANGES.TXT. But... it's still supported in the factory code, with only a "TODO" comment suggesting that a warning would be appropriate, but the actual Lucene filter constructor simply ignores this parameter.



> PatternReplaceCharFilter crashes JVM with OutOfMemoryError
> ----------------------------------------------------------
>
>                 Key: LUCENE-6079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6079
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.10.2
>         Environment: Microsoft Windows, x86_64, 32 GB main memory
>            Reporter: Alexander Veit
>            Priority: Critical
>
> PatternReplaceCharFilter fills memory with input data until an OutOfMemoryError is thrown.
> java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:3332)
> 	at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> 	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> 	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
> 	at java.lang.StringBuilder.append(StringBuilder.java:190)
> 	at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.fill(PatternReplaceCharFilter.java:84)
> 	at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.read(PatternReplaceCharFilter.java:74)
>     ...
> PatternReplaceCharFilter should read data chunk-wise and pass the transformed output chunk-wise to the caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org