You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/02/03 14:40:12 UTC

[jira] [Created] (TIKA-1072) AIOOBE when handling embedded document in .doc file

Michael McCandless created TIKA-1072:
----------------------------------------

             Summary: AIOOBE when handling embedded document in .doc file
                 Key: TIKA-1072
                 URL: https://issues.apache.org/jira/browse/TIKA-1072
             Project: Tika
          Issue Type: Bug
            Reporter: Michael McCandless
             Fix For: 1.4
         Attachments: 20-Force-on-a-current-S00.doc

I have a Word (.doc) document that hits an exception when I run:

{noformat}
java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc 
{noformat}

Here's the exception:

{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
	at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139)
	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
	at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
{noformat}

It happens when we try to parse an OLE10 embedded object ... the code
that does this parsing captures and ignores Ole10NativeException and
skips the entry ... so I'm wondering if we should also catch AIOOBE
and skip the entry?  Ie, maybe this entry really is not OLE10, and the
Ole10Native code is failing to throw Ole10NativeException for it?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira