You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by pi...@huttin.com on 2012/06/13 13:56:50 UTC
Problem to parse a PDF document
Hello,
I have some trouble with documents the library is not not able to
retreive the number of pages and load them into the list using
PDDocument.getDocumentCatalog().getAllPages() method.
The pdf file and the java code to retreive the number of pages are
attached to this mail. apparently it's look like the PDFParser do not
read correctly the /Pages object the ref of pages are "8 0" and "19 0".
I open the document correctly with adobe reader and itextrups, both
retrieve the correct number of pages : 2.
I try to run my code using the version 1.7.0 of PDFBox
Thanks in advance for your help.
Best regards
Pierre Huttin
Re: Problem to parse a PDF document
Posted by Dave Smith <da...@candata.com>.
Fine, but because of 1067 it will not render ...
Dave Smith
Candata Ltd.
416-493-9020x2413
Direct: 416-855-2413
On Wed, Jun 13, 2012 at 10:27 AM, Timo Boehme <ti...@ontochem.com>wrote:
> Hi,
>
> Am 13.06.2012 14:29, schrieb Dave Smith:
>
>> Bug
>> https://issues.apache.org/**jira/browse/PDFBOX-1067<https://issues.apache.org/jira/browse/PDFBOX-1067>
>>
>
> as I see it this bug has nothing to do with PDFBOX-1067 but relates to
> PDFBOX-1099. The PDF in question was changed and we have 2 XREF tables and
> 2 object streams. The pages object (objnr 2) is in both streams (first with
> 1 page, second with 2 pages) and first stream is parsed first, second after
> it and existing objects are skipped which is wrong in this case. For a
> correct handling XREF information must be used.
>
> However there is a workaround: use NonSequentialPDFParser. Load your
> document with PDDocument.loadNonSeq() and you are fine.
>
>
> Best regards,
> Timo
>
> On Wed, Jun 13, 2012 at 8:02 AM,<pi...@huttin.com> wrote:
>>
>> Sorry,
>>>
>>> apparently the pdf was not correctly attached to the previous mail, I
>>> just zip it and re-attach it.
>>>
>>> Pierre Huttin
>>>
>>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have some trouble with documents the library is not not able to
>>>> retreive the number of pages and load them into the list using
>>>> PDDocument.getDocumentCatalog(**).getAllPages() method.
>>>>
>>>> The pdf file and the java code to retreive the number of pages are
>>>> attached to this mail. apparently it's look like the PDFParser do not
>>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>>> 0".
>>>>
>>>> I open the document correctly with adobe reader and itextrups, both
>>>> retrieve the correct number of pages : 2.
>>>>
>>>> I try to run my code using the version 1.7.0 of PDFBox
>>>>
>>>> Thanks in advance for your help.
>>>>
>>>> Best regards
>>>>
>>>> Pierre Huttin
>>>>
>>>
>>>
>>
>
> --
>
> Timo Boehme
> OntoChem GmbH
> H.-Damerow-Str. 4
> 06120 Halle/Saale
> T: +49 345 4780474
> F: +49 345 4780471
> timo.boehme@ontochem.com
>
> ______________________________**______________________________**_________
>
> OntoChem GmbH
> Geschäftsführer: Dr. Lutz Weber
> Sitz: Halle / Saale
> Registergericht: Stendal
> Registernummer: HRB 215461
> ______________________________**______________________________**_________
>
>
Re: Problem to parse a PDF document
Posted by Timo Boehme <ti...@ontochem.com>.
Hi,
Am 13.06.2012 14:29, schrieb Dave Smith:
> Bug
> https://issues.apache.org/jira/browse/PDFBOX-1067
as I see it this bug has nothing to do with PDFBOX-1067 but relates to
PDFBOX-1099. The PDF in question was changed and we have 2 XREF tables
and 2 object streams. The pages object (objnr 2) is in both streams
(first with 1 page, second with 2 pages) and first stream is parsed
first, second after it and existing objects are skipped which is wrong
in this case. For a correct handling XREF information must be used.
However there is a workaround: use NonSequentialPDFParser. Load your
document with PDDocument.loadNonSeq() and you are fine.
Best regards,
Timo
> On Wed, Jun 13, 2012 at 8:02 AM,<pi...@huttin.com> wrote:
>
>> Sorry,
>>
>> apparently the pdf was not correctly attached to the previous mail, I
>> just zip it and re-attach it.
>>
>> Pierre Huttin
>>
>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com> wrote:
>>> Hello,
>>>
>>> I have some trouble with documents the library is not not able to
>>> retreive the number of pages and load them into the list using
>>> PDDocument.getDocumentCatalog().getAllPages() method.
>>>
>>> The pdf file and the java code to retreive the number of pages are
>>> attached to this mail. apparently it's look like the PDFParser do not
>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>> 0".
>>>
>>> I open the document correctly with adobe reader and itextrups, both
>>> retrieve the correct number of pages : 2.
>>>
>>> I try to run my code using the version 1.7.0 of PDFBox
>>>
>>> Thanks in advance for your help.
>>>
>>> Best regards
>>>
>>> Pierre Huttin
>>
>
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
timo.boehme@ontochem.com
_____________________________________________________________________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_____________________________________________________________________
Re: Problem to parse a PDF document
Posted by Dave Smith <da...@candata.com>.
Bug
https://issues.apache.org/jira/browse/PDFBOX-1067
Dave Smith
Candata Ltd.
416-493-9020x2413
Direct: 416-855-2413
On Wed, Jun 13, 2012 at 8:02 AM, <pi...@huttin.com> wrote:
> Sorry,
>
> apparently the pdf was not correctly attached to the previous mail, I
> just zip it and re-attach it.
>
> Pierre Huttin
>
> On Wed, 13 Jun 2012 13:56:50 +0200, <pi...@huttin.com> wrote:
> > Hello,
> >
> > I have some trouble with documents the library is not not able to
> > retreive the number of pages and load them into the list using
> > PDDocument.getDocumentCatalog().getAllPages() method.
> >
> > The pdf file and the java code to retreive the number of pages are
> > attached to this mail. apparently it's look like the PDFParser do not
> > read correctly the /Pages object the ref of pages are "8 0" and "19
> > 0".
> >
> > I open the document correctly with adobe reader and itextrups, both
> > retrieve the correct number of pages : 2.
> >
> > I try to run my code using the version 1.7.0 of PDFBox
> >
> > Thanks in advance for your help.
> >
> > Best regards
> >
> > Pierre Huttin
>
Re: Problem to parse a PDF document
Posted by Timo Boehme <ti...@ontochem.com>.
Dear Pierre Huttin,
Am 14.06.2012 10:07, schrieb pierre@huttin.com:
> Many thanks, I have attached the file to the issue.
Thanks.
> Now it work fine for this kind of documents, but I have a side effect
> on other documents, who works fine in the past.
>
> I receive the following error message.
>
> Caused by: java.io.IOException: Error: Expected an integer type,
> actual='xref'
> org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1541)
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:354)
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:266)
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
> org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
> org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)
>
> If I use the PDDocument.load() method I receive this warning message :
>
> 14 juin 2012 09:58:30 org.apache.pdfbox.pdfparser.XrefTrailerResolver
> setStartxref
> ATTENTION: Did not found XRef object at specified startxref position
> 173
>
> but the document is correctly loaded by PDFBox.
As I see it the document is broken because the offset specified in
startxref does not point to start of xref section. Since
NonSequentialPDFParser currently has only a few options to recover from
parsing problems it stops throwing an exception. With PDDocument.load
you use the standard PDFParser which can better cope with corrupt xref
definition (ignoring it and detecting start of objects by itself) but
has other problems because it does not use xref definitions in some
cases. Thus to get the best of both you should first use
PDDocument.loadNonSeq() and if this fails (exception) try again (fall
back) with PDDocument.load().
> I have a problemn for the sample file, because it contains some
> confidential datas in it.
It is quite clear to me that startxref is wrong. However you could send
only the tail (which contains the 'startxref' and following lines) and
the first 220 byte of the file (according to the exception xref is
supposed to start at 173). With this information which shouldn't contain
any confidential data I could verify the diagnose.
Best regards,
Timo
> On Thu, 14 Jun 2012 00:23:49 +0200, Timo Boehme
> <ti...@ontochem.com> wrote:
>> Am 13.06.2012 14:02, schrieb pierre@huttin.com:
>>> Sorry,
>>>
>>> apparently the pdf was not correctly attached to the previous mail, I
>>> just zip it and re-attach it.
>>>
>>> Pierre Huttin
>>
>> With resolving PDFBOX-1099
>> (https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is
>> correct with both parsers (NonSequentialPDFParser and PDFParser).
>>
>> For testing purposes it would be helpful to have your example PDF
>> associated with PDFBOX-1099. Could you upload it to this issue (and
>> tick the 'Grant license to ASF for inclusion in ASF works (as per the
>> Apache License §5)' or give permission to do so with your file
>> attached to previous email with license grant?
>>
>>
>> Best regards,
>> Timo
>>
>>>
>>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com> wrote:
>>>> Hello,
>>>>
>>>> I have some trouble with documents the library is not not able to
>>>> retreive the number of pages and load them into the list using
>>>> PDDocument.getDocumentCatalog().getAllPages() method.
>>>>
>>>> The pdf file and the java code to retreive the number of pages are
>>>> attached to this mail. apparently it's look like the PDFParser do not
>>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>>> 0".
>>>>
>>>> I open the document correctly with adobe reader and itextrups, both
>>>> retrieve the correct number of pages : 2.
>>>>
>>>> I try to run my code using the version 1.7.0 of PDFBox
>>>>
>>>> Thanks in advance for your help.
>>>>
>>>> Best regards
>>>>
>>>> Pierre Huttin
>
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
timo.boehme@ontochem.com
_____________________________________________________________________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_____________________________________________________________________
Re: Problem to parse a PDF document
Posted by pi...@huttin.com.
Many thanks, I have attached the file to the issue.
Now it work fine for this kind of documents, but I have a side effect
on other documents, who works fine in the past.
I receive the following error message.
Caused by: java.io.IOException: Error: Expected an integer type,
actual='xref'
at
org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1541)
at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:354)
at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:266)
at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
at
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
at
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)
If I use the PDDocument.load() method I receive this warning message :
14 juin 2012 09:58:30 org.apache.pdfbox.pdfparser.XrefTrailerResolver
setStartxref
ATTENTION: Did not found XRef object at specified startxref position
173
but the document is correctly loaded by PDFBox.
I have a problemn for the sample file, because it contains some
confidential datas in it.
Best regards
Pierre Huttin
On Thu, 14 Jun 2012 00:23:49 +0200, Timo Boehme
<ti...@ontochem.com> wrote:
> Am 13.06.2012 14:02, schrieb pierre@huttin.com:
>> Sorry,
>>
>> apparently the pdf was not correctly attached to the previous mail, I
>> just zip it and re-attach it.
>>
>> Pierre Huttin
>
> With resolving PDFBOX-1099
> (https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is
> correct with both parsers (NonSequentialPDFParser and PDFParser).
>
> For testing purposes it would be helpful to have your example PDF
> associated with PDFBOX-1099. Could you upload it to this issue (and
> tick the 'Grant license to ASF for inclusion in ASF works (as per the
> Apache License §5)' or give permission to do so with your file
> attached to previous email with license grant?
>
>
> Best regards,
> Timo
>
>>
>> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com> wrote:
>>> Hello,
>>>
>>> I have some trouble with documents the library is not not able to
>>> retreive the number of pages and load them into the list using
>>> PDDocument.getDocumentCatalog().getAllPages() method.
>>>
>>> The pdf file and the java code to retreive the number of pages are
>>> attached to this mail. apparently it's look like the PDFParser do not
>>> read correctly the /Pages object the ref of pages are "8 0" and "19
>>> 0".
>>>
>>> I open the document correctly with adobe reader and itextrups, both
>>> retrieve the correct number of pages : 2.
>>>
>>> I try to run my code using the version 1.7.0 of PDFBox
>>>
>>> Thanks in advance for your help.
>>>
>>> Best regards
>>>
>>> Pierre Huttin
Re: Problem to parse a PDF document
Posted by Timo Boehme <ti...@ontochem.com>.
Am 13.06.2012 14:02, schrieb pierre@huttin.com:
> Sorry,
>
> apparently the pdf was not correctly attached to the previous mail, I
> just zip it and re-attach it.
>
> Pierre Huttin
With resolving PDFBOX-1099
(https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is
correct with both parsers (NonSequentialPDFParser and PDFParser).
For testing purposes it would be helpful to have your example PDF
associated with PDFBOX-1099. Could you upload it to this issue (and
tick the 'Grant license to ASF for inclusion in ASF works (as per the
Apache License §5)' or give permission to do so with your file attached
to previous email with license grant?
Best regards,
Timo
>
> On Wed, 13 Jun 2012 13:56:50 +0200,<pi...@huttin.com> wrote:
>> Hello,
>>
>> I have some trouble with documents the library is not not able to
>> retreive the number of pages and load them into the list using
>> PDDocument.getDocumentCatalog().getAllPages() method.
>>
>> The pdf file and the java code to retreive the number of pages are
>> attached to this mail. apparently it's look like the PDFParser do not
>> read correctly the /Pages object the ref of pages are "8 0" and "19
>> 0".
>>
>> I open the document correctly with adobe reader and itextrups, both
>> retrieve the correct number of pages : 2.
>>
>> I try to run my code using the version 1.7.0 of PDFBox
>>
>> Thanks in advance for your help.
>>
>> Best regards
>>
>> Pierre Huttin
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
timo.boehme@ontochem.com
_____________________________________________________________________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_____________________________________________________________________
Re: Problem to parse a PDF document
Posted by pi...@huttin.com.
Sorry,
apparently the pdf was not correctly attached to the previous mail, I
just zip it and re-attach it.
Pierre Huttin
On Wed, 13 Jun 2012 13:56:50 +0200, <pi...@huttin.com> wrote:
> Hello,
>
> I have some trouble with documents the library is not not able to
> retreive the number of pages and load them into the list using
> PDDocument.getDocumentCatalog().getAllPages() method.
>
> The pdf file and the java code to retreive the number of pages are
> attached to this mail. apparently it's look like the PDFParser do not
> read correctly the /Pages object the ref of pages are "8 0" and "19
> 0".
>
> I open the document correctly with adobe reader and itextrups, both
> retrieve the correct number of pages : 2.
>
> I try to run my code using the version 1.7.0 of PDFBox
>
> Thanks in advance for your help.
>
> Best regards
>
> Pierre Huttin