You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Doug Carter <dc...@mercycorps.org> on 2010/01/12 20:37:52 UTC

PDF parser exception

Hi all,

I'm new to Tika and to this mailing list, so I hope this is the right
place to ask this question.

I've just downloading, built and installed Tika 0.5. I've been able to
translate Microsoft Office documents without any problems. However, when
I try to translate a PDF file, I get a parser exception.

The command line I'm running is:

  % java -jar tika-app/target/tika-app-0.5.jar foo.pdf

The resulting exception output is:

Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11e1e67
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
        ... 3 more
Caused by: java.util.NoSuchElementException
        at java.util.AbstractList$Itr.next(AbstractList.java:350)
        at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
        at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
        ... 7 more

---

Can someone help point me to a way to solve this problem? I'm familiar
with Java but not the PDF format or how Tika parses a document. 

Please let me know if there is a better forum to ask this question, or
if I need to provide more information.


TIA,

Doug

Re: PDF parser exception

Posted by Ken Krugler <kk...@transpac.com>.

Hi Doug,

> The problem *seems* to be limited to those documents created by
> Acrobat 9. (PDF version 1.5 versus version 1.4) That is, 1.4 documents
> translate OK, where 1.5 documents get this error.
>
> If it matters, the bad file can be opened OK with Acrobat Reader.
>
> Any ideas on how to debug this? Or is Acrobat 9 (version 1.5) a
> known problem for Tika?

Acrobat 9 was a known problem for PDFBox, which is the PDF parser that  
Tika wraps.

But according to http://issues.apache.org/jira/browse/PDFBOX-361, this  
was fixed in 0.8-incubating, which is the release that Tika is using.

However I see http://issues.apache.org/jira/browse/PDFBOX-536, which  
seems to be the same as your issue. That's fixed in PDFBox's trunk,  
but not the 0.8-incubating release.

I've also had to pull/build PDFBox to get a recent (post-0.8) fix, so  
you could do the same.

-- Ken

> On Tue, Jan 12, 2010 at 02:18:02PM -0800, Ken Krugler wrote:
>> Hi Doug,
>>
>> On Jan 12, 2010, at 11:37am, Doug Carter wrote:
>>
>>>
>>> Hi all,
>>>
>>> I'm new to Tika and to this mailing list, so I hope this is the  
>>> right
>>> place to ask this question.
>>>
>>> I've just downloading, built and installed Tika 0.5. I've been  
>>> able to
>>> translate Microsoft Office documents without any problems. However,
>>> when
>>> I try to translate a PDF file, I get a parser exception.
>>
>> Is this the case with any and all PDF files?
>>
>> Based on the stack trace below, it sure looks like a busted file, but
>> I've mostly been working with the HTML parser.
>>
>> -- Ken
>>
>>>
>>> The command line I'm running is:
>>>
>>> % java -jar tika-app/target/tika-app-0.5.jar foo.pdf
>>>
>>> The resulting exception output is:
>>>
>>> Exception in thread "main" org.apache.tika.exception.TikaException:
>>> TIKA-198: Illegal IOException from
>>> org.apache.tika.parser.pdf.PDFParser@11e1e67
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 
>>> 126)
>>>      at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:
>>> 101)
>>>      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
>>>      at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
>>> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:
>>> 237)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
>>> 841)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
>>> 808)
>>>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:
>>> 53)
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 
>>> 120)
>>>      ... 3 more
>>> Caused by: java.util.NoSuchElementException
>>>      at java.util.AbstractList$Itr.next(AbstractList.java:350)
>>>      at
>>> org
>>> .apache
>>> .pdfbox 
>>> .pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:
>>> 115)
>>>      at
>>> org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:
>>> 538)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:
>>> 203)
>>>      ... 7 more
>>>
>>> ---
>>>
>>> Can someone help point me to a way to solve this problem? I'm  
>>> familiar
>>> with Java but not the PDF format or how Tika parses a document.
>>>
>>> Please let me know if there is a better forum to ask this  
>>> question, or
>>> if I need to provide more information.
>>>
>>>
>>> TIA,
>>>
>>> Doug
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: PDF parser exception

Posted by Doug Carter <dc...@mercycorps.org>.

Hi Ken,

The problem *seems* to be limited to those documents created by
Acrobat 9. (PDF version 1.5 versus version 1.4) That is, 1.4 documents
translate OK, where 1.5 documents get this error.

If it matters, the bad file can be opened OK with Acrobat Reader.

Any ideas on how to debug this? Or is Acrobat 9 (version 1.5) a 
known problem for Tika?

Thanks,

Doug

On Tue, Jan 12, 2010 at 02:18:02PM -0800, Ken Krugler wrote:
> Hi Doug,
> 
> On Jan 12, 2010, at 11:37am, Doug Carter wrote:
> 
> >
> >Hi all,
> >
> >I'm new to Tika and to this mailing list, so I hope this is the right
> >place to ask this question.
> >
> >I've just downloading, built and installed Tika 0.5. I've been able to
> >translate Microsoft Office documents without any problems. However,  
> >when
> >I try to translate a PDF file, I get a parser exception.
> 
> Is this the case with any and all PDF files?
> 
> Based on the stack trace below, it sure looks like a busted file, but  
> I've mostly been working with the HTML parser.
> 
> -- Ken
> 
> >
> >The command line I'm running is:
> >
> > % java -jar tika-app/target/tika-app-0.5.jar foo.pdf
> >
> >The resulting exception output is:
> >
> >Exception in thread "main" org.apache.tika.exception.TikaException:  
> >TIKA-198: Illegal IOException from  
> >org.apache.tika.parser.pdf.PDFParser@11e1e67
> >       at  
> >org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> >       at  
> >org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: 
> >101)
> >       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
> >       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> >Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> >       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: 
> >237)
> >       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: 
> >841)
> >       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: 
> >808)
> >       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java: 
> >53)
> >       at  
> >org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> >       ... 3 more
> >Caused by: java.util.NoSuchElementException
> >       at java.util.AbstractList$Itr.next(AbstractList.java:350)
> >       at  
> >org 
> >.apache 
> >.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java: 
> >115)
> >       at  
> >org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java: 
> >538)
> >       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: 
> >203)
> >       ... 7 more
> >
> >---
> >
> >Can someone help point me to a way to solve this problem? I'm familiar
> >with Java but not the PDF format or how Tika parses a document.
> >
> >Please let me know if there is a better forum to ask this question, or
> >if I need to provide more information.
> >
> >
> >TIA,
> >
> >Doug
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
>

Re: PDF parser exception

Posted by Ken Krugler <kk...@transpac.com>.

Hi Doug,

On Jan 12, 2010, at 11:37am, Doug Carter wrote:

>
> Hi all,
>
> I'm new to Tika and to this mailing list, so I hope this is the right
> place to ask this question.
>
> I've just downloading, built and installed Tika 0.5. I've been able to
> translate Microsoft Office documents without any problems. However,  
> when
> I try to translate a PDF file, I get a parser exception.

Is this the case with any and all PDF files?

Based on the stack trace below, it sure looks like a busted file, but  
I've mostly been working with the HTML parser.

-- Ken

>
> The command line I'm running is:
>
>  % java -jar tika-app/target/tika-app-0.5.jar foo.pdf
>
> The resulting exception output is:
>
> Exception in thread "main" org.apache.tika.exception.TikaException:  
> TIKA-198: Illegal IOException from  
> org.apache.tika.parser.pdf.PDFParser@11e1e67
>        at  
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at  
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: 
> 101)
>        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
>        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
>        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: 
> 237)
>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: 
> 841)
>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: 
> 808)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java: 
> 53)
>        at  
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 3 more
> Caused by: java.util.NoSuchElementException
>        at java.util.AbstractList$Itr.next(AbstractList.java:350)
>        at  
> org 
> .apache 
> .pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java: 
> 115)
>        at  
> org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java: 
> 538)
>        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: 
> 203)
>        ... 7 more
>
> ---
>
> Can someone help point me to a way to solve this problem? I'm familiar
> with Java but not the PDF format or how Tika parses a document.
>
> Please let me know if there is a better forum to ask this question, or
> if I need to provide more information.
>
>
> TIA,
>
> Doug

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g