You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2022/08/02 12:01:57 UTC

[Bug 66197] New: OutOfMemoryError occurs while parsing doc file using tika-app which contains poi of the above version

https://bz.apache.org/bugzilla/show_bug.cgi?id=66197

            Bug ID: 66197
           Summary: OutOfMemoryError occurs while parsing doc file using
                    tika-app which contains poi of the above version
           Product: POI
           Version: 5.2.2-FINAL
          Hardware: PC
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: HWPF
          Assignee: dev@poi.apache.org
          Reporter: arjun.cse97@gmail.com
  Target Milestone: ---

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 66197] OutOfMemoryError occurs while parsing doc file using tika-app which contains poi of the above version. When tried to use command line Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 102,853,589

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=66197

earl <ar...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|OutOfMemoryError occurs     |OutOfMemoryError occurs
                   |while parsing doc file      |while parsing doc file
                   |using tika-app which        |using tika-app which
                   |contains poi of the above   |contains poi of the above
                   |version                     |version. When tried to use
                   |                            |command line Caused by:
                   |                            |org.apache.poi.util.RecordF
                   |                            |ormatException: Tried to
                   |                            |allocate an array of length
                   |                            |102,853,589
                 OS|                            |All

--- Comment #1 from earl <ar...@gmail.com> ---
Increasing this value IOUtils.setByteArrayMaxOverride() would not help since we
cannot assume the max size in the customer end.
full stacktrace:
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@662706a7
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:180)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:1086)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:510)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:259)
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an
array of length 102,853,589, but the maximum length for this record type is
100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to
request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with
IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:599)
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:276)
        at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:561)
        at org.apache.poi.util.IOUtils.safelyClone(IOUtils.java:575)
        at
org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:118)
        at
org.apache.poi.hwpf.model.ComplexFileTable.newTextPieceTable(ComplexFileTable.java:111)
        at
org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:72)
        at
org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:77)
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:283)
        at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:152)
        at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:218)
        at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:175)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 66197] OutOfMemoryError occurs while parsing doc file using tika-app which contains poi of the above version. When tried to use command line Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 102,853,589

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=66197

--- Comment #4 from Nick Burch <ap...@gagravarr.org> ---
There's no chance a 406mb file will load into a 1024mb heap. I'd suggest trying
something like 8gb as a minimum, ballpark 10-20x expansion when a document is
loaded into memory

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 66197] OutOfMemoryError occurs while parsing doc file using tika-app which contains poi of the above version. When tried to use command line Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 102,853,589

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=66197

--- Comment #5 from earl <ar...@gmail.com> ---
Can we be sure about this? since we are providing this fix to the customer end,
we may not be able to check it instead we may receive an escalation mail
stating its not working!

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 66197] OutOfMemoryError occurs while parsing doc file using tika-app which contains poi of the above version. When tried to use command line Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 102,853,589

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=66197

--- Comment #2 from PJ Fanning <fa...@yahoo.com> ---
This doesn't look like an OutOfMemoryError - could you adjust the issue
description?

You could use IOUtils.setByteArrayMaxOverride() and set to a large number
because this setting affects a lot of different parts of the POI code.

There is also TextPieceTable.setMaxRecordLength() which only affects the part
of the POI code where you are getting this error.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 66197] OutOfMemoryError occurs while parsing doc file using tika-app which contains poi of the above version. When tried to use command line Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 102,853,589

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=66197

--- Comment #3 from earl <ar...@gmail.com> ---
The above error occurred during command line execution. We actually use tika
parser(OfficeParser in this case) to parse documents. While parsing a doc file
of size around 460 MB with a heap size of around 1024 MB, OutOfMemoryError
occurred! I'll attach that stacktrace too
Stacktrace:
at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
  at java.util.Arrays.copyOf([BI)[B (Arrays.java:3236)
  at java.io.ByteArrayOutputStream.toByteArray()[B
(ByteArrayOutputStream.java:191)
  at org.apache.poi.util.IOUtils.toByteArray(Ljava/io/InputStream;JI)[B
(IOUtils.java:199)
  at org.apache.poi.util.IOUtils.toByteArray(Ljava/io/InputStream;I)[B
(IOUtils.java:149)
  at
org.apache.poi.hwpf.HWPFDocumentCore.getDocumentEntryBytes(Ljava/lang/String;II)[B
(HWPFDocumentCore.java:331)
  at
org.apache.poi.hwpf.HWPFDocumentCore.<init>(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V
(HWPFDocumentCore.java:169)
  at
org.apache.poi.hwpf.HWPFDocument.<init>(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V
(HWPFDocument.java:193)
  at
org.apache.tika.parser.microsoft.WordExtractor.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/sax/XHTMLContentHandler;)V
(WordExtractor.java:152)
  at
org.apache.tika.parser.microsoft.OfficeParser.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/parser/ParseContext;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/sax/XHTMLContentHandler;)V
(OfficeParser.java:216)
  at
org.apache.tika.parser.microsoft.OfficeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(OfficeParser.java:173)
  at
org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(CompositeParser.java:289)
  at
org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(CompositeParser.java:289)
  at
org.apache.tika.parser.AutoDetectParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(AutoDetectParser.java:150)

In dominator tree, the thread that occupies large memory contains a byte
array(size=173960244) with the following data:
.............................s!...bjbjS)S)......................4l^.1C.g1C.g.k!.......................................................................................................................................................................8...L...,...x0..................R...4.......4.......4.......4.......4.......#.......#.......#...........................................................$...Y...........X...................................#.......................#.......#.......#.......#.......................................4...............4...............C.......C.......C.......#...............4...............4.......................C.......................................................#.......................C.......C...........t...........................................................................d.......4..................FQ...................3.......T...............r...........0...........\.......g.......C.......g.......d.....................................................................

I'm sorry I didn't ask the question clearly initially.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 66197] OutOfMemoryError occurs while parsing doc file using tika-app which contains poi of the above version. When tried to use command line Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 102,853,589

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=66197

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |WORKSFORME
           Priority|P2                          |P3
             Status|NEW                         |RESOLVED
           Severity|blocker                     |normal

--- Comment #6 from Dominik Stadler <do...@gmx.at> ---
This is exactly the expected behavior. The default settings of Apache POI
prevent a full OOM and you are free to increase this limit in your application
if you increase your heap-settings along the way.

If you want to allow your customers to process such an incredibly huge file,
you will need to provide a corresponding large Java heap.

I don't see what you we could fix in Apache POI here at all.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org