You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Godmar Back <go...@gmail.com> on 2010/01/07 04:50:56 UTC

IOException when parsing PDF files

Hi,

on the off-chance any other Nutch users have seen this:
*
2010-01-06 21:21:35,679 WARN  parse.pdf - General exception in PDF
parser: Error:
value is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN  parse.pdf - java.io.IOException: Error: value
is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.cos.COSNumber.get(COSNumber.java:110)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

*This is a problem not with Nutch, but with PDFBox (both with 0.7.4 as well
as with the newer 0.8.0) - it can be reproduced by running their text
extractor from the command line.

Interestingly, Poppler (pdfinfo, pdftotext) groks the PDF just fine - is
there a Poppler-based PDF text extractor for Nutch, perhaps?

 - Godmar