You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/07/26 13:22:00 UTC

[jira] [Commented] (TIKA-3823) OutOfMemoryError occurs while parsing a doc file

    [ https://issues.apache.org/jira/browse/TIKA-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571417#comment-17571417 ] 

Tim Allison commented on TIKA-3823:
-----------------------------------

bq. not sure about the uncompressed size.

This is not the same as uncompressed size, but you can try opening the file in MSWord or Open/LibreOffice and saving as text.

There's not much we can do without the test file.  

I would recommend trying to run a more recent version of tika against the file, say {{java -jar tika-app-2.4.1.jar -Xmx2g yourfile.doc}} and seeing if you still get an oom.

> OutOfMemoryError occurs while parsing a doc file
> ------------------------------------------------
>
>                 Key: TIKA-3823
>                 URL: https://issues.apache.org/jira/browse/TIKA-3823
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.23
>            Reporter: earl
>            Priority: Blocker
>
> OutOfMemoryError occurs while parsing a doc file of size 450 MB, not sure about the uncompressed size. While analyzing the heap dump, the thread that parses that file has a byte array of size around 450 MB. The heap size is set to 2 GB still this issue persists.
> Stacktrace
> {code:java}
>   at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
>   at java.util.Arrays.copyOf([BI)[B (Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.toByteArray()[B (ByteArrayOutputStream.java:191)
>   at org.apache.poi.hwpf.HWPFDocumentCore.getDocumentEntryBytes(Ljava/lang/String;II)[B (HWPFDocumentCore.java:353)
>   at org.apache.poi.hwpf.HWPFDocument.<init>(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V (HWPFDocument.java:214)
>   at org.apache.tika.parser.microsoft.WordExtractor.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/sax/XHTMLContentHandler;)V (WordExtractor.java:156)
>   at org.apache.tika.parser.microsoft.OfficeParser.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/parser/ParseContext;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/sax/XHTMLContentHandler;)V (OfficeParser.java:175)
>   at org.apache.tika.parser.microsoft.OfficeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (OfficeParser.java:131)
>   at org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (CompositeParser.java:280)
>   at org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (CompositeParser.java:280)
>   at org.apache.tika.parser.AutoDetectParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (AutoDetectParser.java:143)
> {code}
> The byte array contains something like "....D.d.....................|...L.P.....................................h.." followed by some xml data. Please let me know the issue and what this means.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)