You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by pi...@huttin.com on 2012/06/13 13:56:50 UTC

Problem to parse a PDF document

Hello,

I have some trouble with documents the library is not not able to 
retreive the number of pages and load them into the list using 
PDDocument.getDocumentCatalog().getAllPages() method.

The pdf file and the java code to retreive the number of pages are 
attached to this mail. apparently it's look like the PDFParser do not 
read correctly the /Pages object the ref of pages are "8 0" and "19 0".

I open the document correctly with adobe reader and itextrups, both 
retrieve the correct number of pages : 2.

I try to run my code using the version 1.7.0 of PDFBox

Thanks in advance for your help.

Best regards

Pierre Huttin

Re: Problem to parse a PDF document

Posted by Dave Smith <da...@candata.com>.

Fine, but because of 1067 it will  not render ...

Dave Smith
Candata Ltd.
416-493-9020x2413
Direct: 416-855-2413



On Wed, Jun 13, 2012 at 10:27 AM, Timo Boehme <ti...@ontochem.com>wrote:

> Hi,
>
> Am 13.06.2012 14:29, schrieb Dave Smith:
>
>> Bug
>> https://issues.apache.org/**jira/browse/PDFBOX-1067<https://issues.apache.org/jira/browse/PDFBOX-1067>
>>
>
> as I see it this bug has nothing to do with PDFBOX-1067 but relates to
> PDFBOX-1099. The PDF in question was changed and we have 2 XREF tables and
> 2 object streams. The pages object (objnr 2) is in both streams (first with
> 1 page, second with 2 pages) and first stream is parsed first, second after
> it and existing objects are skipped which is wrong in this case. For a
> correct handling XREF information must be used.
>
> However there is a workaround: use NonSequentialPDFParser. Load your
> document with PDDocument.loadNonSeq() and you are fine.
>
>
> Best regards,
> Timo
>
>  On Wed, Jun 13, 2012 at 8:02 AM,<pi...@huttin.com>  wrote:
>>
>>  Sorry,
>>>
>>> apparently the pdf was not correctly attached to the previous mail, I
>>> just zip it and re-attach it.
>>>
>>> Pierre Huttin
>>>
>>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com>  wrote:
>>>
>>>> Hello,
>>>>
>>>> I have some trouble with documents the library is not not able to
>>>> retreive the number of pages and load them into the list using
>>>> PDDocument.getDocumentCatalog(**).getAllPages() method.
>>>>
>>>> The pdf file and the java code to retreive the number of pages are
>>>> attached to this mail. apparently it's look like the PDFParser do not
>>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>>> 0".
>>>>
>>>> I open the document correctly with adobe reader and itextrups, both
>>>> retrieve the correct number of pages : 2.
>>>>
>>>> I try to run my code using the version 1.7.0 of PDFBox
>>>>
>>>> Thanks in advance for your help.
>>>>
>>>> Best regards
>>>>
>>>> Pierre Huttin
>>>>
>>>
>>>
>>
>
> --
>
>  Timo Boehme
>  OntoChem GmbH
>  H.-Damerow-Str. 4
>  06120 Halle/Saale
>  T: +49 345 4780474
>  F: +49 345 4780471
>  timo.boehme@ontochem.com
>
> ______________________________**______________________________**_________
>
>  OntoChem GmbH
>  Geschäftsführer: Dr. Lutz Weber
>  Sitz: Halle / Saale
>  Registergericht: Stendal
>  Registernummer: HRB 215461
> ______________________________**______________________________**_________
>
>

Re: Problem to parse a PDF document

Posted by Timo Boehme <ti...@ontochem.com>.

Hi,

Am 13.06.2012 14:29, schrieb Dave Smith:
> Bug
> https://issues.apache.org/jira/browse/PDFBOX-1067

as I see it this bug has nothing to do with PDFBOX-1067 but relates to 
PDFBOX-1099. The PDF in question was changed and we have 2 XREF tables 
and 2 object streams. The pages object (objnr 2) is in both streams 
(first with 1 page, second with 2 pages) and first stream is parsed 
first, second after it and existing objects are skipped which is wrong 
in this case. For a correct handling XREF information must be used.

However there is a workaround: use NonSequentialPDFParser. Load your 
document with PDDocument.loadNonSeq() and you are fine.

Best regards,
Timo

> On Wed, Jun 13, 2012 at 8:02 AM,<pi...@huttin.com>  wrote:
>
>> Sorry,
>>
>> apparently the pdf was not correctly attached to the previous mail, I
>> just zip it and re-attach it.
>>
>> Pierre Huttin
>>
>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com>  wrote:
>>> Hello,
>>>
>>> I have some trouble with documents the library is not not able to
>>> retreive the number of pages and load them into the list using
>>> PDDocument.getDocumentCatalog().getAllPages() method.
>>>
>>> The pdf file and the java code to retreive the number of pages are
>>> attached to this mail. apparently it's look like the PDFParser do not
>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>> 0".
>>>
>>> I open the document correctly with adobe reader and itextrups, both
>>> retrieve the correct number of pages : 2.
>>>
>>> I try to run my code using the version 1.7.0 of PDFBox
>>>
>>> Thanks in advance for your help.
>>>
>>> Best regards
>>>
>>> Pierre Huttin
>>
>

-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________

Re: Problem to parse a PDF document

Posted by Dave Smith <da...@candata.com>.

Bug

https://issues.apache.org/jira/browse/PDFBOX-1067

Dave Smith
Candata Ltd.
416-493-9020x2413
Direct: 416-855-2413



On Wed, Jun 13, 2012 at 8:02 AM, <pi...@huttin.com> wrote:

> Sorry,
>
> apparently the pdf was not correctly attached to the previous mail, I
> just zip it and re-attach it.
>
> Pierre Huttin
>
> On Wed, 13 Jun 2012 13:56:50 +0200, <pi...@huttin.com> wrote:
> > Hello,
> >
> > I have some trouble with documents the library is not not able to
> > retreive the number of pages and load them into the list using
> > PDDocument.getDocumentCatalog().getAllPages() method.
> >
> > The pdf file and the java code to retreive the number of pages are
> > attached to this mail. apparently it's look like the PDFParser do not
> > read correctly the /Pages object the ref of pages are "8 0" and "19
> > 0".
> >
> > I open the document correctly with adobe reader and itextrups, both
> > retrieve the correct number of pages : 2.
> >
> > I try to run my code using the version 1.7.0 of PDFBox
> >
> > Thanks in advance for your help.
> >
> > Best regards
> >
> > Pierre Huttin
>

Re: Problem to parse a PDF document

Posted by Timo Boehme <ti...@ontochem.com>.

Dear Pierre Huttin,

Am 14.06.2012 10:07, schrieb pierre@huttin.com:
> Many thanks, I have attached the file to the issue.

Thanks.

> Now it work fine for this kind of documents, but I have a side effect
> on other documents, who works fine in the past.
>
> I receive the following error message.
>
> Caused by: java.io.IOException: Error: Expected an integer type,
> actual='xref'
> org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1541)
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:354)
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:266)
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
> org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
> org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)
>
> If I use the PDDocument.load() method I receive this warning message :
>
> 14 juin 2012 09:58:30 org.apache.pdfbox.pdfparser.XrefTrailerResolver
> setStartxref
> ATTENTION: Did not found XRef object at specified startxref position
> 173
>
> but the document is correctly loaded by PDFBox.

As I see it the document is broken because the offset specified in 
startxref does not point to start of xref section. Since 
NonSequentialPDFParser currently has only a few options to recover from 
parsing problems it stops throwing an exception. With PDDocument.load 
you use the standard PDFParser which can better cope with corrupt xref 
definition (ignoring it and detecting start of objects by itself) but 
has other problems because it does not use xref definitions in some 
cases. Thus to get the best of both you should first use 
PDDocument.loadNonSeq() and if this fails (exception) try again (fall 
back) with PDDocument.load().

> I have a problemn for the sample file, because it contains some
> confidential datas in it.

It is quite clear to me that startxref is wrong. However you could send 
only the tail (which contains the 'startxref' and following lines) and 
the first 220 byte of the file (according to the exception xref is 
supposed to start at 173). With this information which shouldn't contain 
any confidential data I could verify the diagnose.


Best regards,
Timo

> On Thu, 14 Jun 2012 00:23:49 +0200, Timo Boehme
> <ti...@ontochem.com>  wrote:
>> Am 13.06.2012 14:02, schrieb pierre@huttin.com:
>>> Sorry,
>>>
>>> apparently the pdf was not correctly attached to the previous mail, I
>>> just zip it and re-attach it.
>>>
>>> Pierre Huttin
>>
>> With resolving PDFBOX-1099
>> (https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is
>> correct with both parsers (NonSequentialPDFParser and PDFParser).
>>
>> For testing purposes it would be helpful to have your example PDF
>> associated with PDFBOX-1099. Could you upload it to this issue (and
>> tick the 'Grant license to ASF for inclusion in ASF works (as per the
>> Apache License §5)' or give permission to do so with your file
>> attached to previous email with license grant?
>>
>>
>> Best regards,
>> Timo
>>
>>>
>>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com>   wrote:
>>>> Hello,
>>>>
>>>> I have some trouble with documents the library is not not able to
>>>> retreive the number of pages and load them into the list using
>>>> PDDocument.getDocumentCatalog().getAllPages() method.
>>>>
>>>> The pdf file and the java code to retreive the number of pages are
>>>> attached to this mail. apparently it's look like the PDFParser do not
>>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>>> 0".
>>>>
>>>> I open the document correctly with adobe reader and itextrups, both
>>>> retrieve the correct number of pages : 2.
>>>>
>>>> I try to run my code using the version 1.7.0 of PDFBox
>>>>
>>>> Thanks in advance for your help.
>>>>
>>>> Best regards
>>>>
>>>> Pierre Huttin
>


-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________

Re: Problem to parse a PDF document

Posted by pi...@huttin.com.

Many thanks, I have attached the file to the issue.

Now it work fine for this kind of documents, but I have a side effect
on other documents, who works fine in the past.

I receive the following error message.

Caused by: java.io.IOException: Error: Expected an integer type,
actual='xref'
	at
org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1541)
	at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:354)
	at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:266)
	at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
	at
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
	at
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)

If I use the PDDocument.load() method I receive this warning message :

14 juin 2012 09:58:30 org.apache.pdfbox.pdfparser.XrefTrailerResolver
setStartxref
ATTENTION: Did not found XRef object at specified startxref position
173

but the document is correctly loaded by PDFBox.

I have a problemn for the sample file, because it contains some
confidential datas in it.

Best regards

Pierre Huttin



On Thu, 14 Jun 2012 00:23:49 +0200, Timo Boehme
<ti...@ontochem.com> wrote:
> Am 13.06.2012 14:02, schrieb pierre@huttin.com:
>> Sorry,
>>
>> apparently the pdf was not correctly attached to the previous mail, I
>> just zip it and re-attach it.
>>
>> Pierre Huttin
> 
> With resolving PDFBOX-1099
> (https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is
> correct with both parsers (NonSequentialPDFParser and PDFParser).
> 
> For testing purposes it would be helpful to have your example PDF
> associated with PDFBOX-1099. Could you upload it to this issue (and
> tick the 'Grant license to ASF for inclusion in ASF works (as per the
> Apache License §5)' or give permission to do so with your file
> attached to previous email with license grant?
> 
> 
> Best regards,
> Timo
> 
>>
>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com>  wrote:
>>> Hello,
>>>
>>> I have some trouble with documents the library is not not able to
>>> retreive the number of pages and load them into the list using
>>> PDDocument.getDocumentCatalog().getAllPages() method.
>>>
>>> The pdf file and the java code to retreive the number of pages are
>>> attached to this mail. apparently it's look like the PDFParser do not
>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>> 0".
>>>
>>> I open the document correctly with adobe reader and itextrups, both
>>> retrieve the correct number of pages : 2.
>>>
>>> I try to run my code using the version 1.7.0 of PDFBox
>>>
>>> Thanks in advance for your help.
>>>
>>> Best regards
>>>
>>> Pierre Huttin

Re: Problem to parse a PDF document

Posted by Timo Boehme <ti...@ontochem.com>.

Am 13.06.2012 14:02, schrieb pierre@huttin.com:
> Sorry,
>
> apparently the pdf was not correctly attached to the previous mail, I
> just zip it and re-attach it.
>
> Pierre Huttin

With resolving PDFBOX-1099 
(https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is 
correct with both parsers (NonSequentialPDFParser and PDFParser).

For testing purposes it would be helpful to have your example PDF 
associated with PDFBOX-1099. Could you upload it to this issue (and
tick the 'Grant license to ASF for inclusion in ASF works (as per the 
Apache License §5)' or give permission to do so with your file attached 
to previous email with license grant?

Best regards,
Timo

>
> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com>  wrote:
>> Hello,
>>
>> I have some trouble with documents the library is not not able to
>> retreive the number of pages and load them into the list using
>> PDDocument.getDocumentCatalog().getAllPages() method.
>>
>> The pdf file and the java code to retreive the number of pages are
>> attached to this mail. apparently it's look like the PDFParser do not
>> read correctly the /Pages object the ref of pages are "8 0" and "19
>> 0".
>>
>> I open the document correctly with adobe reader and itextrups, both
>> retrieve the correct number of pages : 2.
>>
>> I try to run my code using the version 1.7.0 of PDFBox
>>
>> Thanks in advance for your help.
>>
>> Best regards
>>
>> Pierre Huttin

-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________

Re: Problem to parse a PDF document

Posted by pi...@huttin.com.

Sorry,

apparently the pdf was not correctly attached to the previous mail, I
just zip it and re-attach it.

Pierre Huttin

On Wed, 13 Jun 2012 13:56:50 +0200, <pi...@huttin.com> wrote:
> Hello,
> 
> I have some trouble with documents the library is not not able to
> retreive the number of pages and load them into the list using
> PDDocument.getDocumentCatalog().getAllPages() method.
> 
> The pdf file and the java code to retreive the number of pages are
> attached to this mail. apparently it's look like the PDFParser do not
> read correctly the /Pages object the ref of pages are "8 0" and "19
> 0".
> 
> I open the document correctly with adobe reader and itextrups, both
> retrieve the correct number of pages : 2.
> 
> I try to run my code using the version 1.7.0 of PDFBox
> 
> Thanks in advance for your help.
> 
> Best regards
> 
> Pierre Huttin