You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/04/21 14:32:59 UTC

[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

    [ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504871#comment-14504871 ] 

Tim Allison commented on TIKA-1608:
-----------------------------------

[~jeremybmerrill], thank you for raising this issue. If you go to "More", there's an "Attach Files" option.  As I'm sure you've done, please only attach files that are ok to share with the public, and please let us know if the file is "granted" to Apache under ASF 2.0 so that we can use it in unit tests in the future.

I'll take a look at our govdocs1/CommonCrawl exceptions and see if I can find a doc in there that matches your stack trace.

>From the stacktrace, it looks like the fix will have to be made at the POI level.  I could be wrong, though!  If you haven't done so already, please open a ticket on POI's [bugzilla|https://bz.apache.org/bugzilla/buglist.cgi?quicksearch=poi&list_id=123825]  and add a hyperlink from there to here and vice versa so that we can track progress over here.

Thank you, again.

> RuntimeException on extracting text from Word 97-2004 Document
> --------------------------------------------------------------
>
>                 Key: TIKA-1608
>                 URL: https://issues.apache.org/jira/browse/TIKA-1608
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Jeremy B. Merrill
>
> Extracting text from the Word 97-2004 document located here (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with the following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc 
> Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 	at java.lang.System.arraycopy(Native Method)
> 	at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
> 	at org.apache.poi.hwpf.model.PAPFormattedDiskPage.<init>(PAPFormattedDiskPage.java:101)
> 	at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:49)
> 	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:109)
> 	at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
> 	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> 	... 5 more
> I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11.
> Happy to answer questions; I'm also on the "user" mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)