You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Shinichiro Abe <sh...@gmail.com> on 2011/04/05 09:07:47 UTC
Illegal IOException from tika.parser
�v���I: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@2ca72c6c
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1322)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.rondhuit.servlet.ConvDoubleByteSpaceToHalfFilter.doFilter(ConvDoubleByteSpaceToHalfFilter.java:32)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.rondhuit.servlet.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:105)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@2ca72c6c
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:148)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:194)
... 24 more
Caused by: java.io.IOException: For input string: "00000000-1"
at org.apache.pdfbox.pdfparser.PDFParser.parseXrefTable(PDFParser.java:709)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:449)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
Re: Illegal IOException from tika.parser
Posted by Shinichiro Abe <sh...@gmail.com>.
Hello.
Thank you for your reply.
I'll try to report at JIRA of PDFbox with sample PDF .
Thank you.
Shinichiro Abe
On 2011/04/05, Jukka Zitting wrote:
> Hi,
>
> On 04/05/2011 09:07 AM, Shinichiro Abe wrote:
>> It seems like an error raised at pdfbox, and pdfbox cannot recognize
>> something about XrefTable of the pdf? What kind of error is it?
>
> The PDF in question might be malformed, or there could be a bug in PDFBox that prevents it from correctly parsing this file.
>
> To solve the problem, the best way is to report the issue to the PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX, ideally with the sample PDF as an attachment.
>
> Such troubles are fairly common when you are dealing with large numbers of files from various different sources. Usually they aren't too troublesome, as you often can live with not being able to search such documents based on their full text contents. For example in Apache Jackrabbit we simply log such problems and index the document as if it was empty. It's of course a good idea to report such issues so they can be fixed in future versions.
>
> --
> Jukka Zitting
Re: Illegal IOException from tika.parser
Posted by Jukka Zitting <jz...@adobe.com>.
Hi,
On 04/05/2011 09:07 AM, Shinichiro Abe wrote:
> It seems like an error raised at pdfbox, and pdfbox cannot recognize
> something about XrefTable of the pdf? What kind of error is it?
The PDF in question might be malformed, or there could be a bug in
PDFBox that prevents it from correctly parsing this file.
To solve the problem, the best way is to report the issue to the PDFBox
issue tracker at https://issues.apache.org/jira/browse/PDFBOX, ideally
with the sample PDF as an attachment.
Such troubles are fairly common when you are dealing with large numbers
of files from various different sources. Usually they aren't too
troublesome, as you often can live with not being able to search such
documents based on their full text contents. For example in Apache
Jackrabbit we simply log such problems and index the document as if it
was empty. It's of course a good idea to report such issues so they can
be fixed in future versions.
--
Jukka Zitting