You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2013/02/26 18:28:13 UTC

[jira] [Commented] (TIKA-1089) Tika conversion failed on following documents

    [ https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13587284#comment-13587284 ] 

Ken Krugler commented on TIKA-1089:
-----------------------------------

Hi Hong-Thai,

I took a quick look at crawler.log (thanks for attaching that file), and these are failures thrown by the underlying parsing libraries used by Tika. For example:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 140
	at org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:405)
	at org.apache.poi.hslf.usermodel.SlideShow.<init>(SlideShow.java:109)
	at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:51)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:189)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)

This is an exception thrown by POI's PowerPoint (I assume) parser.

What this means is you'd want to file issues against the various projects that Tika uses.

I'll leave it to others on the list who are more familiar with POI, PDFBox, etc. to provide specific guidance.
                
> Tika conversion failed on following documents
> ---------------------------------------------
>
>                 Key: TIKA-1089
>                 URL: https://issues.apache.org/jira/browse/TIKA-1089
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>         Environment: windows, api
>            Reporter: Hong-Thai Nguyen
>              Labels: test
>         Attachments: crawler.log
>
>
> We are using Tika as our major converter of divers file formats to text, html version in a Search Engine.
> We've collected some documents (46) which Tika can not convert: http://www.mediafire.com/?60clr812lerx3gy

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira