You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/02/04 15:04:14 UTC
[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file

    [ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570208#comment-13570208 ] 

Michael McCandless commented on TIKA-1072:
------------------------------------------

OK I did some digging on this.  The DirectoryNode of this embedded document has these entries:
{noformat}
ent=PICT size=797
ent=ObjInfo size=4
ent=Ole10Native size=40
ent=Ole10FmtProgID size=13
ent=OlePres000 size=40
ent=CompObj size=82
ent=PIC size=100
ent=META size=582
ent=Ole size=20
{noformat}

And so I believe it really is an OLE10Native record... OLE10Native then tries to parse it, with plain=false, but then runs out of bytes on this line:
{noformat}
      flags2 = LittleEndian.getShort(data, ofs);
{noformat}

It seems likely something is corrupt about this entry?  Does 40 bytes seem way too small for an OLE10Native entry? If so, I wonder if we could fix AbstractPOIFSExtractor to log the exception and then skip this one embedded document and then go on to parsing the others?  Ie, isolate the exception, rather than aborting the entire extraction; in this case the main document extracts fine.
                
> AIOOBE when handling embedded document in .doc file
> ---------------------------------------------------
>
>                 Key: TIKA-1072
>                 URL: https://issues.apache.org/jira/browse/TIKA-1072
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 1.4
>
>         Attachments: 20-Force-on-a-current-S00.doc
>
>
> I have a Word (.doc) document that hits an exception when I run:
> {noformat}
> java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc 
> {noformat}
> Here's the exception:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
> 	at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
> 	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139)
> 	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
> 	at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
> 	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> {noformat}
> It happens when we try to parse an OLE10 embedded object ... the code
> that does this parsing captures and ignores Ole10NativeException and
> skips the entry ... so I'm wondering if we should also catch AIOOBE
> and skip the entry?  Ie, maybe this entry really is not OLE10, and the
> Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira