You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2019/12/18 20:47:00 UTC
[jira] [Comment Edited] (TIKA-3017) OOM in XSLFSheet.java

    [ https://issues.apache.org/jira/browse/TIKA-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999321#comment-16999321 ] 

Tim Allison edited comment on TIKA-3017 at 12/18/19 8:46 PM:
-------------------------------------------------------------

POI is using a BitSet to track shapeId's. One of the shapeIds value is 1,970,148,883, which requires a long[] of length 30,783,576.

 

I can parse the file with -Xmx2g, but this needs to be fixed at the POI level.


was (Author: tallison@mitre.org):
POI is using a BitSet to track shapeId's. One of the shapeIds value is 1970148883, which requires a long[] of length 30,783,576.

 

I can parse the file with -Xmx2g, but this needs to be fixed at the POI level.

> OOM in XSLFSheet.java
> ---------------------
>
>                 Key: TIKA-3017
>                 URL: https://issues.apache.org/jira/browse/TIKA-3017
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.19
>            Reporter: Don
>            Priority: Major
>         Attachments: OOM_Slide_18.pptx
>
>
> When tika parses the attached power point slide it OOMs every time. The slide is a scrubbed slide from a Microsoft PowerPoint deck. Unfortunately I have no idea how the slide was created. When you open the slide it will look like it is a totally blank slide, however if you perform a select all on the slide while it is open in PowerPoint you will see there are two items contained in the slide, one inside the other. The person that created the slide deck is not longer available to give details as to how the slide was created. The two items in the slide deck appear to be text boxes, but I am not sure this is the case because if either one is removed and replace with a textbox using MS PowerPoint the OOM does not happen anymore. Also, if the slide is open in LibreOffice and then saved, the OOM does not happen. There seems to be something specific about whatever these items really are and how they were created.
> The following is the stack trace of the OOM when it is parsed by tika:
> {noformat}
> Executor task launch worker for task 47360
>  at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
>  at java.util.Arrays.copyOf([JI)[J (Arrays.java:3308)
>  at java.util.BitSet.ensureCapacity(I)V (BitSet.java:337)
>  at java.util.BitSet.expandTo(I)V (BitSet.java:352)
>  at java.util.BitSet.set(I)V (BitSet.java:447)
>  at org.apache.poi.xslf.usermodel.XSLFSheet.registerShapeId(I)V (XSLFSheet.java:123)
>  at org.apache.poi.xslf.usermodel.XSLFDrawing.<init>(Lorg/apache/poi/xslf/usermodel/XSLFSheet;Lorg/openxmlformats/schemas/presentationml/x2006/main/CTGroupShape;)V (XSLFDrawing.java:47)
>  at org.apache.poi.xslf.usermodel.XSLFSheet.initDrawingAndShapes()V (XSLFSheet.java:214)
>  at org.apache.poi.xslf.usermodel.XSLFSheet.getShapes()Ljava/util/List; (XSLFSheet.java:201)
>  at org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator.buildXHTML(Lorg/apache/tika/sax/XHTMLContentHandler;)V (XSLFPowerPointExtractorDecorator.java:110)
>  at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (AbstractOOXMLExtractor.java:136)
>  at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (OOXMLExtractorFactory.java:156)
>  at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (OOXMLParser.java:110)
>  at org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (CompositeParser.java:280)
>  at org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V (AutoDetectParser.java:143)
>  at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)