You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/01/18 15:02:26 UTC

[jira] [Commented] (TIKA-2109) OutOfMemory when parsing 5MB word document

    [ https://issues.apache.org/jira/browse/TIKA-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828228#comment-15828228 ] 

Tim Allison commented on TIKA-2109:
-----------------------------------

One feature of this document is that it is highly repetitive, which means that it compresses dramatically.  If you decompress the document.xml file, it weighs in at 70MB.  Our classic docx parser relies on beans and DOM, which means that it loads the entire thing into memory and has overhead for the beans.  I was able to parse it with 1.5g...but, y, that is bloaty.

If you switch to the new experimental SAX parser for docx (TIKA-1321), the file requires a far lower memory footprint.

{noformat}
java -jar -Xmx64m tika-app-1.15-SNAPSHOT.jar --config=tika_config.xml TIKA-2109.docx > testout.txt
{noformat}

where tika_config.xml is:
{noformat}
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!--<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>-->
    </parser>
        <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
            <params>
                <param name="useSAXDocxExtractor" type="bool">true</param>
                <param name="includeDeletedContent" type="bool">true</param>
                <param name="includeMoveFromContent" type="bool">true</param>
            </params>
        </parser>
  </parsers>
</properties>
{noformat}

> OutOfMemory when parsing 5MB word document
> ------------------------------------------
>
>                 Key: TIKA-2109
>                 URL: https://issues.apache.org/jira/browse/TIKA-2109
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>         Environment: openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~14.04-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>            Reporter: Julian
>         Attachments: zafar-bug-9.docx
>
>
> When I run the following command to extract text from the attached 5MB word document, I get the OOM error below.
> java -jar tika-app-1.13.jar --text '/vagrant/zafar-bug-9.docx'
> The problem goes away if I set -Xms2G -Xmx2G, but I'm reluctant to specify such a high setting for my use case for what seems like a small file? Also I don't see this error with other files of similar size.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeObject(DeferredDocumentImpl.java:972)
> 	at com.sun.org.apache.xerces.internal.dom.DeferredElementNSImpl.synchronizeData(DeferredElementNSImpl.java:126)
> 	at com.sun.org.apache.xerces.internal.dom.ElementNSImpl.getNamespaceURI(ElementNSImpl.java:250)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1420)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
> 	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
> 	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385)
> 	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370)
> 	at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
> 	at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:117)
> 	at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
> 	at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:164)
> 	at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> 	at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
> 	at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
> 	at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)