You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Shinichiro Abe <sh...@gmail.com> on 2011/04/05 09:07:47 UTC

Illegal IOException from tika.parser

�v���I: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@2ca72c6c
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
	at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1322)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at com.rondhuit.servlet.ConvDoubleByteSpaceToHalfFilter.doFilter(ConvDoubleByteSpaceToHalfFilter.java:32)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at com.rondhuit.servlet.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:105)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
	at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@2ca72c6c
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:148)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:194)
	... 24 more
Caused by: java.io.IOException: For input string: "00000000-1"
	at org.apache.pdfbox.pdfparser.PDFParser.parseXrefTable(PDFParser.java:709)
	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:449)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)


Re: Illegal IOException from tika.parser

Posted by Shinichiro Abe <sh...@gmail.com>.
Hello.
Thank you for your reply.
I'll try to report at JIRA of PDFbox with sample PDF .

Thank you.
Shinichiro Abe

On 2011/04/05, Jukka Zitting wrote:

> Hi,
> 
> On 04/05/2011 09:07 AM, Shinichiro Abe wrote:
>> It seems like an error raised at pdfbox, and pdfbox cannot recognize
>> something about XrefTable of the pdf? What kind of error is it?
> 
> The PDF in question might be malformed, or there could be a bug in PDFBox that prevents it from correctly parsing this file.
> 
> To solve the problem, the best way is to report the issue to the PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX, ideally with the sample PDF as an attachment.
> 
> Such troubles are fairly common when you are dealing with large numbers of files from various different sources. Usually they aren't too troublesome, as you often can live with not being able to search such documents based on their full text contents. For example in Apache Jackrabbit we simply log such problems and index the document as if it was empty. It's of course a good idea to report such issues so they can be fixed in future versions.
> 
> -- 
> Jukka Zitting


Re: Illegal IOException from tika.parser

Posted by Jukka Zitting <jz...@adobe.com>.
Hi,

On 04/05/2011 09:07 AM, Shinichiro Abe wrote:
> It seems like an error raised at pdfbox, and pdfbox cannot recognize
> something about XrefTable of the pdf? What kind of error is it?

The PDF in question might be malformed, or there could be a bug in 
PDFBox that prevents it from correctly parsing this file.

To solve the problem, the best way is to report the issue to the PDFBox 
issue tracker at https://issues.apache.org/jira/browse/PDFBOX, ideally 
with the sample PDF as an attachment.

Such troubles are fairly common when you are dealing with large numbers 
of files from various different sources. Usually they aren't too 
troublesome, as you often can live with not being able to search such 
documents based on their full text contents. For example in Apache 
Jackrabbit we simply log such problems and index the document as if it 
was empty. It's of course a good idea to report such issues so they can 
be fixed in future versions.

-- 
Jukka Zitting