You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/07/26 13:23:00 UTC

[jira] [Comment Edited] (TIKA-3823) OutOfMemoryError occurs while parsing a doc file

    [ https://issues.apache.org/jira/browse/TIKA-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571417#comment-17571417 ] 

Tim Allison edited comment on TIKA-3823 at 7/26/22 1:22 PM:
------------------------------------------------------------

bq. not sure about the uncompressed size.

This is not the same as uncompressed size, but you can try opening the file in MSWord or Open/LibreOffice and saving as text.  Clearly though, the problem is on file loading, not with buffering the extracted text to memory while parsing.

There's not much we can do without the test file.  

I would recommend trying to run a more recent version of tika against the file, say {{java -jar tika-app-2.4.1.jar -Xmx2g yourfile.doc}} and seeing if you still get an oom.


was (Author: tallison@mitre.org):
bq. not sure about the uncompressed size.

This is not the same as uncompressed size, but you can try opening the file in MSWord or Open/LibreOffice and saving as text.

There's not much we can do without the test file.  

I would recommend trying to run a more recent version of tika against the file, say {{java -jar tika-app-2.4.1.jar -Xmx2g yourfile.doc}} and seeing if you still get an oom.

> OutOfMemoryError occurs while parsing a doc file
> ------------------------------------------------
>
>                 Key: TIKA-3823
>                 URL: https://issues.apache.org/jira/browse/TIKA-3823
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.23
>            Reporter: earl
>            Priority: Blocker
>
> OutOfMemoryError occurs while parsing a doc file of size 450 MB, not sure about the uncompressed size. While analyzing the heap dump, the thread that parses that file has a byte array of size around 450 MB. The heap size is set to 2 GB still this issue persists.
> Stacktrace
> {code:java}
>   at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
>   at java.util.Arrays.copyOf([BI)[B (Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.toByteArray()[B (ByteArrayOutputStream.java:191)
>   at org.apache.poi.hwpf.HWPFDocumentCore.getDocumentEntryBytes(Ljava/lang/String;II)[B (HWPFDocumentCore.java:353)
>   at org.apache.poi.hwpf.HWPFDocument.<init>(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V (HWPFDocument.java:214)
>   at org.apache.tika.parser.microsoft.WordExtractor.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/sax/XHTMLContentHandler;)V (WordExtractor.java:156)
>   at org.apache.tika.parser.microsoft.OfficeParser.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/parser/ParseContext;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/sax/XHTMLContentHandler;)V (OfficeParser.java:175)
>   at org.apache.tika.parser.microsoft.OfficeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (OfficeParser.java:131)
>   at org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (CompositeParser.java:280)
>   at org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (CompositeParser.java:280)
>   at org.apache.tika.parser.AutoDetectParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (AutoDetectParser.java:143)
> {code}
> The byte array contains something like "....D.d.....................|...L.P.....................................h.." followed by some xml data. Please let me know the issue and what this means.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)