You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Godmar Back <go...@gmail.com> on 2010/01/07 04:45:16 UTC

java.io.IOException: Error: value is not an integer type actual='-' when parsing PDF

Hi,

I'm trying to use PDFBox to index PDF files via the Nutch plugin. Nutch uses
PDFBox 0.7.4, but I also tried pdfbox 0.8.0incubating, with the same effect.

I am unable to parse any PDFs created by ScanSoft PDF Create! 3. I'm seeing
the following error:

In 0.7.4/Nutch:

*2010-01-06 21:21:35,679 WARN  parse.pdf - General exception in PDF
parser: Error:
value is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN  parse.pdf - java.io.IOException: Error: value
is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.cos.COSNumber.get(COSNumber.java:110)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)*

when running pdfbox 0.8.0's ExtractText:

*Exception in thread "main" java.io.IOException: Error: value is not an
integer type actual='-'
       at org.apache.pdfbox.cos.COSInteger.<init>(COSInteger.java:71)
       at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:96)
       at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:255)
       at
org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:101)
       at
org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
       at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
       at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
       at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
       at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
       at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
       at org.apache.pdfbox.ExtractText.main(ExtractText.java:229)
*
Apparently, PDFbox attempts to interpret a '-' as a Long.

pdfinfo and pdftotext, part of Poppler, do not have trouble parsing these
files.

I don't want to post the PDF in question, but would be willing to email it
to an interested developer.

The PDF contains:

*/CreationDate (D:20081026134850-05'00')*

Not having read the PDF spec, I'm guessing that PDFbox may have trouble
parsing this date (and misinterprets the '-' as the nex token).
Looking at org.apache.pdfbox.util.DateConverter, I see:

    private static final SimpleDateFormat[] POTENTIAL_FORMATS = new
SimpleDateFormat[] {
        new SimpleDateFormat("EEEE, dd MMM yyyy hh:mm:ss a"),
        new SimpleDateFormat("EEEE, MMM dd, yyyy hh:mm:ss a"),
        new SimpleDateFormat("MM/dd/yyyy hh:mm:ss"),
        new SimpleDateFormat("MM/dd/yyyy")};

Perhaps the Date format used in these PDF files needs to be added to
POTENTIAL_FORMATs?

Thanks for any insight you could provide.

This hickup is preventing me from ingesting several PDFs into Nutch.

 - Godmar

Re: java.io.IOException: Error: value is not an integer type actual='-' when parsing PDF

Posted by Godmar Back <go...@gmail.com>.

update:

my first hunch that this error is related to date parsing was wrong. The
error actually occurs inside a 'stream' element while parsing a number. The
stream has multiple 'Tm' sequences such as

1 0 0 1 - 783  Tm

in it.

According to PDF 1.7 [1], the 'Tm' operator needs to be preceded by six
numbers, of which the fifth's denote the 'x' component of the translation
(in what I assume are homogeneous coordinates). '-' is not a number in PDF,
so Ben's parser is correct to throw an exception --- I'm wondering though if
it's reasonable to substitute a '0' for a '-' where a number is expected?

I made that change to 0.8.0 which lets the parsing and text extraction
complete; now I'm seeing a number of errors which are unrelated; I will
report them in a separate thread.

 - Godmar

On Wed, Jan 6, 2010 at 10:45 PM, Godmar Back <go...@gmail.com> wrote:

>
> Hi,
>
> I'm trying to use PDFBox to index PDF files via the Nutch plugin. Nutch
> uses PDFBox 0.7.4, but I also tried pdfbox 0.8.0incubating, with the same
> effect.
>
> I am unable to parse any PDFs created by ScanSoft PDF Create! 3. I'm seeing
> the following error:
>
> In 0.7.4/Nutch:
>
> *2010-01-06 21:21:35,679 WARN  parse.pdf - General exception in PDF
> parser: Error: value is not an integer type actual='-'
> 2010-01-06 21:21:35,679 WARN  parse.pdf - java.io.IOException: Error: value
> is not an integer type actual='-'
> 2010-01-06 21:21:35,679 WARN  parse.pdf - at
> org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85)
> 2010-01-06 21:21:35,679 WARN  parse.pdf - at
> org.pdfbox.cos.COSNumber.get(COSNumber.java:110)
> 2010-01-06 21:21:35,679 WARN  parse.pdf - at
> org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
> 2010-01-06 21:21:35,679 WARN  parse.pdf - at
> org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> 2010-01-06 21:21:35,680 WARN  parse.pdf - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)*
>
> when running pdfbox 0.8.0's ExtractText:
>
> *Exception in thread "main" java.io.IOException: Error: value is not an
> integer type actual='-'
>        at org.apache.pdfbox.cos.COSInteger.<init>(COSInteger.java:71)
>        at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:96)
>        at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:255)
>        at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:101)
>        at
> org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
>        at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
>        at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
>        at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
>        at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
>        at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
>        at org.apache.pdfbox.ExtractText.main(ExtractText.java:229)
> *
> Apparently, PDFbox attempts to interpret a '-' as a Long.
>
> pdfinfo and pdftotext, part of Poppler, do not have trouble parsing these
> files.
>
> I don't want to post the PDF in question, but would be willing to email it
> to an interested developer.
>
> The PDF contains:
>
> */CreationDate (D:20081026134850-05'00')*
>
> Not having read the PDF spec, I'm guessing that PDFbox may have trouble
> parsing this date (and misinterprets the '-' as the nex token).
> Looking at org.apache.pdfbox.util.DateConverter, I see:
>
>     private static final SimpleDateFormat[] POTENTIAL_FORMATS = new SimpleDateFormat[] {
>         new SimpleDateFormat("EEEE, dd MMM yyyy hh:mm:ss a"),
>         new SimpleDateFormat("EEEE, MMM dd, yyyy hh:mm:ss a"),
>         new SimpleDateFormat("MM/dd/yyyy hh:mm:ss"),
>         new SimpleDateFormat("MM/dd/yyyy")};
>
> Perhaps the Date format used in these PDF files needs to be added to
> POTENTIAL_FORMATs?
>
> Thanks for any insight you could provide.
>
> This hickup is preventing me from ingesting several PDFs into Nutch.
>
>  - Godmar
>
>