You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Godmar Back <go...@gmail.com> on 2010/01/07 04:45:16 UTC
java.io.IOException: Error: value is not an integer type actual='-'
when parsing PDF
Hi,
I'm trying to use PDFBox to index PDF files via the Nutch plugin. Nutch uses
PDFBox 0.7.4, but I also tried pdfbox 0.8.0incubating, with the same effect.
I am unable to parse any PDFs created by ScanSoft PDF Create! 3. I'm seeing
the following error:
In 0.7.4/Nutch:
*2010-01-06 21:21:35,679 WARN parse.pdf - General exception in PDF
parser: Error:
value is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN parse.pdf - java.io.IOException: Error: value
is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85)
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.cos.COSNumber.get(COSNumber.java:110)
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)*
when running pdfbox 0.8.0's ExtractText:
*Exception in thread "main" java.io.IOException: Error: value is not an
integer type actual='-'
at org.apache.pdfbox.cos.COSInteger.<init>(COSInteger.java:71)
at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:96)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:255)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:101)
at
org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:229)
*
Apparently, PDFbox attempts to interpret a '-' as a Long.
pdfinfo and pdftotext, part of Poppler, do not have trouble parsing these
files.
I don't want to post the PDF in question, but would be willing to email it
to an interested developer.
The PDF contains:
*/CreationDate (D:20081026134850-05'00')*
Not having read the PDF spec, I'm guessing that PDFbox may have trouble
parsing this date (and misinterprets the '-' as the nex token).
Looking at org.apache.pdfbox.util.DateConverter, I see:
private static final SimpleDateFormat[] POTENTIAL_FORMATS = new
SimpleDateFormat[] {
new SimpleDateFormat("EEEE, dd MMM yyyy hh:mm:ss a"),
new SimpleDateFormat("EEEE, MMM dd, yyyy hh:mm:ss a"),
new SimpleDateFormat("MM/dd/yyyy hh:mm:ss"),
new SimpleDateFormat("MM/dd/yyyy")};
Perhaps the Date format used in these PDF files needs to be added to
POTENTIAL_FORMATs?
Thanks for any insight you could provide.
This hickup is preventing me from ingesting several PDFs into Nutch.
- Godmar
Re: java.io.IOException: Error: value is not an integer type
actual='-' when parsing PDF
Posted by Godmar Back <go...@gmail.com>.
update:
my first hunch that this error is related to date parsing was wrong. The
error actually occurs inside a 'stream' element while parsing a number. The
stream has multiple 'Tm' sequences such as
1 0 0 1 - 783 Tm
in it.
According to PDF 1.7 [1], the 'Tm' operator needs to be preceded by six
numbers, of which the fifth's denote the 'x' component of the translation
(in what I assume are homogeneous coordinates). '-' is not a number in PDF,
so Ben's parser is correct to throw an exception --- I'm wondering though if
it's reasonable to substitute a '0' for a '-' where a number is expected?
I made that change to 0.8.0 which lets the parsing and text extraction
complete; now I'm seeing a number of errors which are unrelated; I will
report them in a separate thread.
- Godmar
On Wed, Jan 6, 2010 at 10:45 PM, Godmar Back <go...@gmail.com> wrote:
>
> Hi,
>
> I'm trying to use PDFBox to index PDF files via the Nutch plugin. Nutch
> uses PDFBox 0.7.4, but I also tried pdfbox 0.8.0incubating, with the same
> effect.
>
> I am unable to parse any PDFs created by ScanSoft PDF Create! 3. I'm seeing
> the following error:
>
> In 0.7.4/Nutch:
>
> *2010-01-06 21:21:35,679 WARN parse.pdf - General exception in PDF
> parser: Error: value is not an integer type actual='-'
> 2010-01-06 21:21:35,679 WARN parse.pdf - java.io.IOException: Error: value
> is not an integer type actual='-'
> 2010-01-06 21:21:35,679 WARN parse.pdf - at
> org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85)
> 2010-01-06 21:21:35,679 WARN parse.pdf - at
> org.pdfbox.cos.COSNumber.get(COSNumber.java:110)
> 2010-01-06 21:21:35,679 WARN parse.pdf - at
> org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
> 2010-01-06 21:21:35,679 WARN parse.pdf - at
> org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> 2010-01-06 21:21:35,680 WARN parse.pdf - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)*
>
> when running pdfbox 0.8.0's ExtractText:
>
> *Exception in thread "main" java.io.IOException: Error: value is not an
> integer type actual='-'
> at org.apache.pdfbox.cos.COSInteger.<init>(COSInteger.java:71)
> at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:96)
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:255)
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:101)
> at
> org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:229)
> *
> Apparently, PDFbox attempts to interpret a '-' as a Long.
>
> pdfinfo and pdftotext, part of Poppler, do not have trouble parsing these
> files.
>
> I don't want to post the PDF in question, but would be willing to email it
> to an interested developer.
>
> The PDF contains:
>
> */CreationDate (D:20081026134850-05'00')*
>
> Not having read the PDF spec, I'm guessing that PDFbox may have trouble
> parsing this date (and misinterprets the '-' as the nex token).
> Looking at org.apache.pdfbox.util.DateConverter, I see:
>
> private static final SimpleDateFormat[] POTENTIAL_FORMATS = new SimpleDateFormat[] {
> new SimpleDateFormat("EEEE, dd MMM yyyy hh:mm:ss a"),
> new SimpleDateFormat("EEEE, MMM dd, yyyy hh:mm:ss a"),
> new SimpleDateFormat("MM/dd/yyyy hh:mm:ss"),
> new SimpleDateFormat("MM/dd/yyyy")};
>
> Perhaps the Date format used in these PDF files needs to be added to
> POTENTIAL_FORMATs?
>
> Thanks for any insight you could provide.
>
> This hickup is preventing me from ingesting several PDFs into Nutch.
>
> - Godmar
>
>