You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2020/06/17 10:18:00 UTC
[jira] [Comment Edited] (TIKA-3097) Out of memory while parsing docx

    [ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138315#comment-17138315 ] 

Tim Allison edited comment on TIKA-3097 at 6/17/20, 10:17 AM:
--------------------------------------------------------------

Yes, even if the file is read as a stream. IIRC, some parsers only work with files because they need random access to the stream. For example, if the xlsx parser hits  sheet1.xml before hitting the sharedstrings.xml as it streams the zip entries, it’d be out of luck.

Even without needing random access, some parsers may choose to build the document components in memory for various reasons before we can extract text.

We try to stream as we can, but some file formats are less than helpful for streaming and some of the parsers in our dependencies are not optimized for text extraction.

If you find obvious areas for improvements, let us know.


was (Author: tallison@mitre.org):
Yes, even if the file is read as a stream. IIRC, some files only work with files because they need random access to the stream. For example, if the xlsx parser hits  sheet1.xml before hitting the sharedstrings.xml as it streams the zip entries, it’d be out of luck.

Even without needing random access, some parsers may choose to build the document components in memory for various reasons before we can extract text.

We try to stream as we can, but some file formats are less than helpful for streaming and some of the parsers in our dependencies are not optimized for text extraction.

If you find obvious areas for improvements, let us know.

> Out of memory while parsing docx
> --------------------------------
>
>                 Key: TIKA-3097
>                 URL: https://issues.apache.org/jira/browse/TIKA-3097
>             Project: Tika
>          Issue Type: Bug
>          Components: core, parser
>    Affects Versions: 1.24
>            Reporter: suchendra
>            Priority: Major
>         Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file which is docx. JVM goes OOM when tika tries to parse the file. I have configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)