You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Vomlel Jan <Ja...@aipsafe.cz> on 2014/10/10 09:09:14 UTC

problem with pdf eof

Hello,
I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
Method return without exception, but document model is incomplete.

Problem is in characters after EOF (ofset 22939):
startxref
22449
%%EOF
@
16 0 obj
<<
/Type /Catalog

PDFBox create internal IOException and ignore it with comment:
                    /*
                     * PDF files may have random data after the EOF marker. Ignore errors if
                     * last object processed is EOF.
                     */

Is this PDF construction valid?
Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.

Jan



________________________________

Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..

Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
signatures are currently only supported with load not loadNonSeq AFAIK. Please open an issue in Jira [1] with sample files and code to reproduce the issue so this is not forgotten.

There are plans to make the nonSeq parser the default one and ensure that the features match the old one as nonSeq is parsing in line with the pdf spec. It’s likely that this will be done in stages as we are looking to get out PDFBox 2.0 as soon as possible (but there are still some major issues open - Jira will give you an overview of these).

BR

Maruan

[1] https://issues.apache.org/jira/browse/PDFBOX

Am 16.10.2014 um 15:44 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> When I use load insted of loadNoSeq, signatures are in this case  valid.
> 
> But for some documents load function doesnot read complete document. That is why I used loadNoSeq. Some signatures are then missing.
> 
> Viz:
> http://leteckaposta.cz/831516385
> h1.pdf - original file (signature and timestamp)
> h2.pdf - add first signature by pdfbox (timestamp is missing)
> h3.pdf - add second signature by pdfbox (timestamp and previous signature is missing)
> 
> Jan
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Thursday, October 16, 2014 2:37 PM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
> 
> when signing please make sure that you load the pdf using PDDocument.load instead of PDDocument.loadNonSeq.
> 
> 
> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Thursday, October 16, 2014 11:55 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> when you say invalid do you mean it’s corrupted or e.g. you get a warning sign in Adobe Reader? Would you have a sample PDF?
>> 
>> When you sign a document and sign it again the first signature points to a different document revision as you have changed the documents content afterwards. So invalid in that context could mean that the warning you might be getting is only reflecting that fact. Would need to see the document to  understand what’s going on.
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Hi Maruan and others,
>>> 
>>> I created signature and it seems OK. 
>>> But when I create second signature (loadNonSeq, addSignature, saveIncremental again), the first signature becomes invalid. 
>>> I think that there can be problem, that first page is updated (signatur is invisible), but I dont understand it enough. 
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>> Sent: Monday, October 13, 2014 4:09 PM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi Jan,
>>> 
>>> there are sample in the examples package for various ways to sign a document [1]. Signing a document needs incremental saving.
>>> 
>>> OTOH choosing the right solution should not be made on the base if there is a license fee or not. 
>>> 
>>> Maruan Sahyoun
>>> 
>>> [1] http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
>>> 
>>> 
>>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>>> Hi Maruan (and others),
>>>> 
>>>> I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
>>>> We used itext for it, but it is under commercial licence.
>>>> Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
>>>> Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>>> Sent: Friday, October 10, 2014 10:44 AM
>>>> To: users@pdfbox.apache.org
>>>> Subject: Re: problem with pdf eof
>>>> 
>>>> Hi Jan,
>>>> 
>>>> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
>>>> It can 
>>>> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
>>>> 
>>>> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
>>>> 
>>>> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
>>>> 
>>>> BR
>>>> 
>>>> Maruan
>>>> 
>>>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>>> 
>>>>> Thank you Maruan, this function loads document.
>>>>> 
>>>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>>>>> 
>>>>> Jan
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>>>> Sent: Friday, October 10, 2014 9:25 AM
>>>>> To: users@pdfbox.apache.or
>>>>> Subject: Re: problem with pdf eof
>>>>> 
>>>>> Hi 
>>>>> 
>>>>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>>>>> 
>>>>> BR
>>>>> 
>>>>> Maruan
>>>>> 
>>>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>>>> 
>>>>>> Hello,
>>>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>>>>> Method return without exception, but document model is incomplete.
>>>>>> 
>>>>>> Problem is in characters after EOF (ofset 22939):
>>>>>> startxref
>>>>>> 22449
>>>>>> %%EOF
>>>>>> @
>>>>>> 16 0 obj
>>>>>> << 
>>>>>> /Type /Catalog
>>>>>> 
>>>>>> PDFBox create internal IOException and ignore it with comment:
>>>>>>                /*
>>>>>>                 * PDF files may have random data after the EOF marker. Ignore errors if
>>>>>>                 * last object processed is EOF.
>>>>>>                 */
>>>>>> 
>>>>>> Is this PDF construction valid?
>>>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>>>>> 
>>>>>> Jan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>>>> 
>>>> 
>>> 
>> 
> 


RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.
Yes, we work with pdf files, with two %%EOF, after first eof are several random bytes followed by regular object.
PDFBox doesnot read this object, viz bug PDFBOX-2436. So I created patch for it.
Jan

 
-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Monday, October 20, 2014 2:39 PM
To: users@pdfbox.apache.org
Subject: Re: problem with pdf eof

If the PDF contains incremental updates then there will be multiple %%EOF - that’s fine.

BR

Maruan

Am 20.10.2014 um 13:50 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hi Maruan,
> 
> I create patch for bug PDFBOX-2436.
> 
> After %%EOF it skips data to next object.
> 
> I don´t know, if such data are allowed by specification, but some czech portal create them and acrobat have no problem with them.
> 
> I changed org.apache.pdfbox.pdfparser.PDFParser near line 584, branch 1.8. Can you commit it and fix this bug?
> 
> 
>                            pdfSource.unread(eof.getBytes("ISO-8859-1"));
>                        }
>                    }
>                }
>                isEndOfFile = true;
> 
>                //PDFBOX-2436 - some files contain binary data after %%EOF.
>                skipToNextObj();
>            }
>        }
>        //we are going to parse an normal object
>        Else
> 
> Thank you, Jan
> 
> -----Original Message-----
> From: Vomlel Jan
> Sent: Friday, October 17, 2014 9:12 AM
> To: users@pdfbox.apache.org; brzrkr@pobox.com
> Subject: RE: problem with pdf eof
> 
> I reported parsing error for load function:
> https://issues.apache.org/jira/browse/PDFBOX-2436
> Jan
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Thursday, October 16, 2014 8:23 PM
> To: users@pdfbox.apache.org; brzrkr@pobox.com
> Subject: Re: problem with pdf eof
> 
> sorry if that has been unclear - as of now if you’d like to sign you have to use load() loadNonSeq() is not an option!
> 
> For all other cases use loadNonSeq() and if that fails load() as a fallback.
> 
> We are working on getting the missing signing support into nonSeq() but that will probably be after 2.0.
> 
> Now if you have parsing issues with load() please open an issue in Jira and attach the PDFs together with code to reproduce it. Same if you have parsing issues with loadNonSeq().
> 
> Of course if someone is willing to help getting that in … patches are welcome.
> 
> Maruan
> 
> Am 16.10.2014 um 20:13 schrieb Brzrk One <br...@gmail.com>:
> 
>> I hear dual advice here...
>> - don't use NonSeq for signatures
>> - but use NonSeq for multiple EOFs
>> Files with both multiple EOFs and signatures will have problems...
>> unless you mean we should parse 2x?
>> 
>> On Thu, Oct 16, 2014 at 12:12 PM, Maruan Sahyoun 
>> <sa...@fileaffairs.de>
>> wrote:
>> 
>>> depends on the parser being used. NonSeq does follow the Xref 
>>> information and handles multiple EOFs (incremental updates) when parsing.
>>> 
>>> BR
>>> Maruan
>>> 
>>> Am 16.10.2014 um 17:01 schrieb Brzrk One <br...@gmail.com>:
>>> 
>>> I've noticed that when there are multiple EOFs in the file, PDFBox 
>>> parsing is less reliable.
>>> 
>>> 
>>> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:
>>> 
>>> When I use load insted of loadNoSeq, signatures are in this case  valid.
>>> 
>>> But for some documents load function doesnot read complete document.
>>> That is why I used loadNoSeq. Some signatures are then missing.
>>> 
>>> Viz:
>>> http://leteckaposta.cz/831516385
>>> h1.pdf - original file (signature and timestamp) h2.pdf - add first 
>>> signature by pdfbox (timestamp is missing) h3.pdf - add second 
>>> signature by pdfbox (timestamp and previous signature is missing)
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Thursday, October 16, 2014 2:37 PM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> when signing please make sure that you load the pdf using 
>>> PDDocument.load instead of PDDocument.loadNonSeq.
>>> 
>>> 
>>> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Thursday, October 16, 2014 11:55 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> when you say invalid do you mean it’s corrupted or e.g. you get a
>>> 
>>> warning sign in Adobe Reader? Would you have a sample PDF?
>>> 
>>> 
>>> When you sign a document and sign it again the first signature 
>>> points to
>>> 
>>> a different document revision as you have changed the documents 
>>> content afterwards. So invalid in that context could mean that the 
>>> warning you might be getting is only reflecting that fact. Would 
>>> need to see the document to  understand what’s going on.
>>> 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Hi Maruan and others,
>>> 
>>> I created signature and it seems OK.
>>> But when I create second signature (loadNonSeq, addSignature,
>>> 
>>> saveIncremental again), the first signature becomes invalid.
>>> 
>>> I think that there can be problem, that first page is updated 
>>> (signatur
>>> 
>>> is invisible), but I dont understand it enough.
>>> 
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Monday, October 13, 2014 4:09 PM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi Jan,
>>> 
>>> there are sample in the examples package for various ways to sign a
>>> 
>>> document [1]. Signing a document needs incremental saving.
>>> 
>>> 
>>> OTOH choosing the right solution should not be made on the base if
>>> 
>>> there is a license fee or not.
>>> 
>>> 
>>> Maruan Sahyoun
>>> 
>>> [1]
>>> 
>>> 
>>> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org
>>> /
>>> apache/pdfbox/examples/signature/
>>> 
>>> 
>>> 
>>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Hi Maruan (and others),
>>> 
>>> I would like to use pdfbox and bouncycastle for managing pdf
>>> 
>>> signatures. Parsing, validation, timestamping (PADES LTV) .
>>> 
>>> We used itext for it, but it is under commercial licence.
>>> Parsing signatures seems to be working (thanks to your advice). So I
>>> 
>>> will try to create timestamp.
>>> 
>>> Is it possible with pdfbox?  I found save method on PDDocument, but
>>> 
>>> Iˇm afraid, that it can change bite representation of pdf, and 
>>> signatures become invalid. Is it true? What is right way to create 
>>> signature or timestamp with pdfbox?
>>> 
>>> 
>>> Jan
>>> 
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Friday, October 10, 2014 10:44 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi Jan,
>>> 
>>> choosing the right technology is very important so I do understand
>>> 
>>> your concerns. I had to make such decision about using PDFBox in the 
>>> past too.
>>> 
>>> It can
>>> If you have specific issues I can answer I’m happy to try to do so.
>>> As
>>> 
>>> a general statement PDFBox is used in production environments today 
>>> (as an example we ourselves are using it for a banking customer to 
>>> process account statements, an airline company to preprocess 
>>> archiving documents and various other customers).
>>> 
>>> 
>>> PDFBox is continuously enhancing the parsing as we try to deal with
>>> 
>>> real world PDF files which are not always inline with the the PDF 
>>> specification. Currently the best approach is to use 
>>> PDDocument.loadNonSeq (which parses documents according to the Xref
>>> information) and in case of an exception PDDocument.load (which 
>>> parses sequentially). The Apache Tika project, which uses PDFBox for 
>>> parsing PDF’s, is running the parsing and text extraction against 
>>> 50k PDFs being made available via http://digitalcorpora.org
>>> 
>>> 
>>> What is the application you would like to be using PDFBox for? Text
>>> 
>>> Extraction, image conversion …. - I might be able to give you more 
>>> specific information for your use case.
>>> 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Thank you Maruan, this function loads document.
>>> 
>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
>>> 
>>> PDF parsing". I think correct parsing is very important, and I have 
>>> some doubts, if I can use pdfbox in production. Can you say 
>>> something to rest me :-).
>>> 
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Friday, October 10, 2014 9:25 AM
>>> To: users@pdfbox.apache.or
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi
>>> 
>>> you can try PDDocument.loadNonSeq(InputStream is, null)
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Hello,
>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
>>> 
>>> PDF document in attachement.
>>> 
>>> Method return without exception, but document model is incomplete.
>>> 
>>> Problem is in characters after EOF (ofset 22939):
>>> startxref
>>> 22449
>>> %%EOF
>>> @
>>> 16 0 obj
>>> <<
>>> /Type /Catalog
>>> 
>>> PDFBox create internal IOException and ignore it with comment:
>>>              /*
>>>               * PDF files may have random data after the EOF
>>> 
>>> marker. Ignore errors if
>>> 
>>>               * last object processed is EOF.
>>>               */
>>> 
>>> Is this PDF construction valid?
>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but
>>> 
>>> another error occured.
>>> 
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
>>> 
>>> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud 
>>> tomu tak není, nelze je považovat za jednání, které by zakládalo 
>>> jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen 
>>> pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě 
>>> uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených 
>>> souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, 
>>> prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, 
>>> distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených 
>>> souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, 
>>> neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte.
>>> Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o.
>>> nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní 
>>> e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů 
>>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než 
>>> je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> 
> ________________________________
> 
> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..


Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
If the PDF contains incremental updates then there will be multiple %%EOF - that’s fine.

BR

Maruan

Am 20.10.2014 um 13:50 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hi Maruan,
> 
> I create patch for bug PDFBOX-2436.
> 
> After %%EOF it skips data to next object.
> 
> I don´t know, if such data are allowed by specification, but some czech portal create them and acrobat have no problem with them.
> 
> I changed org.apache.pdfbox.pdfparser.PDFParser near line 584, branch 1.8. Can you commit it and fix this bug?
> 
> 
>                            pdfSource.unread(eof.getBytes("ISO-8859-1"));
>                        }
>                    }
>                }
>                isEndOfFile = true;
> 
>                //PDFBOX-2436 - some files contain binary data after %%EOF.
>                skipToNextObj();
>            }
>        }
>        //we are going to parse an normal object
>        Else
> 
> Thank you, Jan
> 
> -----Original Message-----
> From: Vomlel Jan
> Sent: Friday, October 17, 2014 9:12 AM
> To: users@pdfbox.apache.org; brzrkr@pobox.com
> Subject: RE: problem with pdf eof
> 
> I reported parsing error for load function:
> https://issues.apache.org/jira/browse/PDFBOX-2436
> Jan
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Thursday, October 16, 2014 8:23 PM
> To: users@pdfbox.apache.org; brzrkr@pobox.com
> Subject: Re: problem with pdf eof
> 
> sorry if that has been unclear - as of now if you’d like to sign you have to use load() loadNonSeq() is not an option!
> 
> For all other cases use loadNonSeq() and if that fails load() as a fallback.
> 
> We are working on getting the missing signing support into nonSeq() but that will probably be after 2.0.
> 
> Now if you have parsing issues with load() please open an issue in Jira and attach the PDFs together with code to reproduce it. Same if you have parsing issues with loadNonSeq().
> 
> Of course if someone is willing to help getting that in … patches are welcome.
> 
> Maruan
> 
> Am 16.10.2014 um 20:13 schrieb Brzrk One <br...@gmail.com>:
> 
>> I hear dual advice here...
>> - don't use NonSeq for signatures
>> - but use NonSeq for multiple EOFs
>> Files with both multiple EOFs and signatures will have problems...
>> unless you mean we should parse 2x?
>> 
>> On Thu, Oct 16, 2014 at 12:12 PM, Maruan Sahyoun
>> <sa...@fileaffairs.de>
>> wrote:
>> 
>>> depends on the parser being used. NonSeq does follow the Xref
>>> information and handles multiple EOFs (incremental updates) when parsing.
>>> 
>>> BR
>>> Maruan
>>> 
>>> Am 16.10.2014 um 17:01 schrieb Brzrk One <br...@gmail.com>:
>>> 
>>> I've noticed that when there are multiple EOFs in the file, PDFBox
>>> parsing is less reliable.
>>> 
>>> 
>>> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:
>>> 
>>> When I use load insted of loadNoSeq, signatures are in this case  valid.
>>> 
>>> But for some documents load function doesnot read complete document.
>>> That is why I used loadNoSeq. Some signatures are then missing.
>>> 
>>> Viz:
>>> http://leteckaposta.cz/831516385
>>> h1.pdf - original file (signature and timestamp) h2.pdf - add first
>>> signature by pdfbox (timestamp is missing) h3.pdf - add second
>>> signature by pdfbox (timestamp and previous signature is missing)
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Thursday, October 16, 2014 2:37 PM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> when signing please make sure that you load the pdf using
>>> PDDocument.load instead of PDDocument.loadNonSeq.
>>> 
>>> 
>>> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Thursday, October 16, 2014 11:55 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> when you say invalid do you mean it’s corrupted or e.g. you get a
>>> 
>>> warning sign in Adobe Reader? Would you have a sample PDF?
>>> 
>>> 
>>> When you sign a document and sign it again the first signature points
>>> to
>>> 
>>> a different document revision as you have changed the documents
>>> content afterwards. So invalid in that context could mean that the
>>> warning you might be getting is only reflecting that fact. Would need
>>> to see the document to  understand what’s going on.
>>> 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Hi Maruan and others,
>>> 
>>> I created signature and it seems OK.
>>> But when I create second signature (loadNonSeq, addSignature,
>>> 
>>> saveIncremental again), the first signature becomes invalid.
>>> 
>>> I think that there can be problem, that first page is updated
>>> (signatur
>>> 
>>> is invisible), but I dont understand it enough.
>>> 
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Monday, October 13, 2014 4:09 PM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi Jan,
>>> 
>>> there are sample in the examples package for various ways to sign a
>>> 
>>> document [1]. Signing a document needs incremental saving.
>>> 
>>> 
>>> OTOH choosing the right solution should not be made on the base if
>>> 
>>> there is a license fee or not.
>>> 
>>> 
>>> Maruan Sahyoun
>>> 
>>> [1]
>>> 
>>> 
>>> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/
>>> apache/pdfbox/examples/signature/
>>> 
>>> 
>>> 
>>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Hi Maruan (and others),
>>> 
>>> I would like to use pdfbox and bouncycastle for managing pdf
>>> 
>>> signatures. Parsing, validation, timestamping (PADES LTV) .
>>> 
>>> We used itext for it, but it is under commercial licence.
>>> Parsing signatures seems to be working (thanks to your advice). So I
>>> 
>>> will try to create timestamp.
>>> 
>>> Is it possible with pdfbox?  I found save method on PDDocument, but
>>> 
>>> Iˇm afraid, that it can change bite representation of pdf, and
>>> signatures become invalid. Is it true? What is right way to create
>>> signature or timestamp with pdfbox?
>>> 
>>> 
>>> Jan
>>> 
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Friday, October 10, 2014 10:44 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi Jan,
>>> 
>>> choosing the right technology is very important so I do understand
>>> 
>>> your concerns. I had to make such decision about using PDFBox in the
>>> past too.
>>> 
>>> It can
>>> If you have specific issues I can answer I’m happy to try to do so.
>>> As
>>> 
>>> a general statement PDFBox is used in production environments today
>>> (as an example we ourselves are using it for a banking customer to
>>> process account statements, an airline company to preprocess
>>> archiving documents and various other customers).
>>> 
>>> 
>>> PDFBox is continuously enhancing the parsing as we try to deal with
>>> 
>>> real world PDF files which are not always inline with the the PDF
>>> specification. Currently the best approach is to use
>>> PDDocument.loadNonSeq (which parses documents according to the Xref
>>> information) and in case of an exception PDDocument.load (which
>>> parses sequentially). The Apache Tika project, which uses PDFBox for
>>> parsing PDF’s, is running the parsing and text extraction against 50k
>>> PDFs being made available via http://digitalcorpora.org
>>> 
>>> 
>>> What is the application you would like to be using PDFBox for? Text
>>> 
>>> Extraction, image conversion …. - I might be able to give you more
>>> specific information for your use case.
>>> 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Thank you Maruan, this function loads document.
>>> 
>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
>>> 
>>> PDF parsing". I think correct parsing is very important, and I have
>>> some doubts, if I can use pdfbox in production. Can you say something
>>> to rest me :-).
>>> 
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Friday, October 10, 2014 9:25 AM
>>> To: users@pdfbox.apache.or
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi
>>> 
>>> you can try PDDocument.loadNonSeq(InputStream is, null)
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>> Hello,
>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
>>> 
>>> PDF document in attachement.
>>> 
>>> Method return without exception, but document model is incomplete.
>>> 
>>> Problem is in characters after EOF (ofset 22939):
>>> startxref
>>> 22449
>>> %%EOF
>>> @
>>> 16 0 obj
>>> <<
>>> /Type /Catalog
>>> 
>>> PDFBox create internal IOException and ignore it with comment:
>>>              /*
>>>               * PDF files may have random data after the EOF
>>> 
>>> marker. Ignore errors if
>>> 
>>>               * last object processed is EOF.
>>>               */
>>> 
>>> Is this PDF construction valid?
>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but
>>> 
>>> another error occured.
>>> 
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
>>> 
>>> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu
>>> tak není, nelze je považovat za jednání, které by zakládalo jakékoliv
>>> nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze
>>> uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako
>>> příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je
>>> důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím,
>>> jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo
>>> šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud
>>> jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho
>>> odesilateli a e-mail, včetně všech připojených souborů, vymažte.
>>> Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o.
>>> nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní
>>> e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů
>>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než
>>> je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> 
> ________________________________
> 
> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..


RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.
Hi Maruan,

I create patch for bug PDFBOX-2436.

After %%EOF it skips data to next object.

I don´t know, if such data are allowed by specification, but some czech portal create them and acrobat have no problem with them.

I changed org.apache.pdfbox.pdfparser.PDFParser near line 584, branch 1.8. Can you commit it and fix this bug?


                            pdfSource.unread(eof.getBytes("ISO-8859-1"));
                        }
                    }
                }
                isEndOfFile = true;

                //PDFBOX-2436 - some files contain binary data after %%EOF.
                skipToNextObj();
            }
        }
        //we are going to parse an normal object
        Else

Thank you, Jan

-----Original Message-----
From: Vomlel Jan
Sent: Friday, October 17, 2014 9:12 AM
To: users@pdfbox.apache.org; brzrkr@pobox.com
Subject: RE: problem with pdf eof

I reported parsing error for load function:
https://issues.apache.org/jira/browse/PDFBOX-2436
Jan

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
Sent: Thursday, October 16, 2014 8:23 PM
To: users@pdfbox.apache.org; brzrkr@pobox.com
Subject: Re: problem with pdf eof

sorry if that has been unclear - as of now if you’d like to sign you have to use load() loadNonSeq() is not an option!

For all other cases use loadNonSeq() and if that fails load() as a fallback.

We are working on getting the missing signing support into nonSeq() but that will probably be after 2.0.

Now if you have parsing issues with load() please open an issue in Jira and attach the PDFs together with code to reproduce it. Same if you have parsing issues with loadNonSeq().

Of course if someone is willing to help getting that in … patches are welcome.

Maruan

Am 16.10.2014 um 20:13 schrieb Brzrk One <br...@gmail.com>:

> I hear dual advice here...
> - don't use NonSeq for signatures
> - but use NonSeq for multiple EOFs
> Files with both multiple EOFs and signatures will have problems...
> unless you mean we should parse 2x?
>
> On Thu, Oct 16, 2014 at 12:12 PM, Maruan Sahyoun
> <sa...@fileaffairs.de>
> wrote:
>
>> depends on the parser being used. NonSeq does follow the Xref
>> information and handles multiple EOFs (incremental updates) when parsing.
>>
>> BR
>> Maruan
>>
>> Am 16.10.2014 um 17:01 schrieb Brzrk One <br...@gmail.com>:
>>
>> I've noticed that when there are multiple EOFs in the file, PDFBox
>> parsing is less reliable.
>>
>>
>> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:
>>
>> When I use load insted of loadNoSeq, signatures are in this case  valid.
>>
>> But for some documents load function doesnot read complete document.
>> That is why I used loadNoSeq. Some signatures are then missing.
>>
>> Viz:
>> http://leteckaposta.cz/831516385
>> h1.pdf - original file (signature and timestamp) h2.pdf - add first
>> signature by pdfbox (timestamp is missing) h3.pdf - add second
>> signature by pdfbox (timestamp and previous signature is missing)
>>
>> Jan
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Thursday, October 16, 2014 2:37 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> when signing please make sure that you load the pdf using
>> PDDocument.load instead of PDDocument.loadNonSeq.
>>
>>
>> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>>
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Thursday, October 16, 2014 11:55 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> when you say invalid do you mean it’s corrupted or e.g. you get a
>>
>> warning sign in Adobe Reader? Would you have a sample PDF?
>>
>>
>> When you sign a document and sign it again the first signature points
>> to
>>
>> a different document revision as you have changed the documents
>> content afterwards. So invalid in that context could mean that the
>> warning you might be getting is only reflecting that fact. Would need
>> to see the document to  understand what’s going on.
>>
>>
>> BR
>>
>> Maruan
>>
>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Hi Maruan and others,
>>
>> I created signature and it seems OK.
>> But when I create second signature (loadNonSeq, addSignature,
>>
>> saveIncremental again), the first signature becomes invalid.
>>
>> I think that there can be problem, that first page is updated
>> (signatur
>>
>> is invisible), but I dont understand it enough.
>>
>>
>> Jan
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Monday, October 13, 2014 4:09 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> Hi Jan,
>>
>> there are sample in the examples package for various ways to sign a
>>
>> document [1]. Signing a document needs incremental saving.
>>
>>
>> OTOH choosing the right solution should not be made on the base if
>>
>> there is a license fee or not.
>>
>>
>> Maruan Sahyoun
>>
>> [1]
>>
>>
>> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/
>> apache/pdfbox/examples/signature/
>>
>>
>>
>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Hi Maruan (and others),
>>
>> I would like to use pdfbox and bouncycastle for managing pdf
>>
>> signatures. Parsing, validation, timestamping (PADES LTV) .
>>
>> We used itext for it, but it is under commercial licence.
>> Parsing signatures seems to be working (thanks to your advice). So I
>>
>> will try to create timestamp.
>>
>> Is it possible with pdfbox?  I found save method on PDDocument, but
>>
>> Iˇm afraid, that it can change bite representation of pdf, and
>> signatures become invalid. Is it true? What is right way to create
>> signature or timestamp with pdfbox?
>>
>>
>> Jan
>>
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Friday, October 10, 2014 10:44 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> Hi Jan,
>>
>> choosing the right technology is very important so I do understand
>>
>> your concerns. I had to make such decision about using PDFBox in the
>> past too.
>>
>> It can
>> If you have specific issues I can answer I’m happy to try to do so.
>> As
>>
>> a general statement PDFBox is used in production environments today
>> (as an example we ourselves are using it for a banking customer to
>> process account statements, an airline company to preprocess
>> archiving documents and various other customers).
>>
>>
>> PDFBox is continuously enhancing the parsing as we try to deal with
>>
>> real world PDF files which are not always inline with the the PDF
>> specification. Currently the best approach is to use
>> PDDocument.loadNonSeq (which parses documents according to the Xref
>> information) and in case of an exception PDDocument.load (which
>> parses sequentially). The Apache Tika project, which uses PDFBox for
>> parsing PDF’s, is running the parsing and text extraction against 50k
>> PDFs being made available via http://digitalcorpora.org
>>
>>
>> What is the application you would like to be using PDFBox for? Text
>>
>> Extraction, image conversion …. - I might be able to give you more
>> specific information for your use case.
>>
>>
>> BR
>>
>> Maruan
>>
>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Thank you Maruan, this function loads document.
>>
>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
>>
>> PDF parsing". I think correct parsing is very important, and I have
>> some doubts, if I can use pdfbox in production. Can you say something
>> to rest me :-).
>>
>>
>> Jan
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Friday, October 10, 2014 9:25 AM
>> To: users@pdfbox.apache.or
>> Subject: Re: problem with pdf eof
>>
>> Hi
>>
>> you can try PDDocument.loadNonSeq(InputStream is, null)
>>
>> BR
>>
>> Maruan
>>
>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Hello,
>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
>>
>> PDF document in attachement.
>>
>> Method return without exception, but document model is incomplete.
>>
>> Problem is in characters after EOF (ofset 22939):
>> startxref
>> 22449
>> %%EOF
>> @
>> 16 0 obj
>> <<
>> /Type /Catalog
>>
>> PDFBox create internal IOException and ignore it with comment:
>>               /*
>>                * PDF files may have random data after the EOF
>>
>> marker. Ignore errors if
>>
>>                * last object processed is EOF.
>>                */
>>
>> Is this PDF construction valid?
>> Which parser in PDFBox is correct? I tried ConformingPDParser, but
>>
>> another error occured.
>>
>>
>> Jan
>>
>>
>>
>>
>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
>>
>> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu
>> tak není, nelze je považovat za jednání, které by zakládalo jakékoliv
>> nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze
>> uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako
>> příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je
>> důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím,
>> jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo
>> šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud
>> jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho
>> odesilateli a e-mail, včetně všech připojených souborů, vymažte.
>> Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o.
>> nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní
>> e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů
>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než
>> je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>
>>
>>
>>
>>
>>
>>
>>
>>


________________________________

Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..

RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.
I reported parsing error for load function:
https://issues.apache.org/jira/browse/PDFBOX-2436
Jan

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
Sent: Thursday, October 16, 2014 8:23 PM
To: users@pdfbox.apache.org; brzrkr@pobox.com
Subject: Re: problem with pdf eof

sorry if that has been unclear - as of now if you’d like to sign you have to use load() loadNonSeq() is not an option!

For all other cases use loadNonSeq() and if that fails load() as a fallback.

We are working on getting the missing signing support into nonSeq() but that will probably be after 2.0.

Now if you have parsing issues with load() please open an issue in Jira and attach the PDFs together with code to reproduce it. Same if you have parsing issues with loadNonSeq().

Of course if someone is willing to help getting that in … patches are welcome.

Maruan

Am 16.10.2014 um 20:13 schrieb Brzrk One <br...@gmail.com>:

> I hear dual advice here...
> - don't use NonSeq for signatures
> - but use NonSeq for multiple EOFs
> Files with both multiple EOFs and signatures will have problems...
> unless you mean we should parse 2x?
>
> On Thu, Oct 16, 2014 at 12:12 PM, Maruan Sahyoun
> <sa...@fileaffairs.de>
> wrote:
>
>> depends on the parser being used. NonSeq does follow the Xref
>> information and handles multiple EOFs (incremental updates) when parsing.
>>
>> BR
>> Maruan
>>
>> Am 16.10.2014 um 17:01 schrieb Brzrk One <br...@gmail.com>:
>>
>> I've noticed that when there are multiple EOFs in the file, PDFBox
>> parsing is less reliable.
>>
>>
>> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:
>>
>> When I use load insted of loadNoSeq, signatures are in this case  valid.
>>
>> But for some documents load function doesnot read complete document.
>> That is why I used loadNoSeq. Some signatures are then missing.
>>
>> Viz:
>> http://leteckaposta.cz/831516385
>> h1.pdf - original file (signature and timestamp) h2.pdf - add first
>> signature by pdfbox (timestamp is missing) h3.pdf - add second
>> signature by pdfbox (timestamp and previous signature is missing)
>>
>> Jan
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Thursday, October 16, 2014 2:37 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> when signing please make sure that you load the pdf using
>> PDDocument.load instead of PDDocument.loadNonSeq.
>>
>>
>> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>>
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Thursday, October 16, 2014 11:55 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> when you say invalid do you mean it’s corrupted or e.g. you get a
>>
>> warning sign in Adobe Reader? Would you have a sample PDF?
>>
>>
>> When you sign a document and sign it again the first signature points
>> to
>>
>> a different document revision as you have changed the documents
>> content afterwards. So invalid in that context could mean that the
>> warning you might be getting is only reflecting that fact. Would need
>> to see the document to  understand what’s going on.
>>
>>
>> BR
>>
>> Maruan
>>
>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Hi Maruan and others,
>>
>> I created signature and it seems OK.
>> But when I create second signature (loadNonSeq, addSignature,
>>
>> saveIncremental again), the first signature becomes invalid.
>>
>> I think that there can be problem, that first page is updated
>> (signatur
>>
>> is invisible), but I dont understand it enough.
>>
>>
>> Jan
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Monday, October 13, 2014 4:09 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> Hi Jan,
>>
>> there are sample in the examples package for various ways to sign a
>>
>> document [1]. Signing a document needs incremental saving.
>>
>>
>> OTOH choosing the right solution should not be made on the base if
>>
>> there is a license fee or not.
>>
>>
>> Maruan Sahyoun
>>
>> [1]
>>
>>
>> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/
>> apache/pdfbox/examples/signature/
>>
>>
>>
>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Hi Maruan (and others),
>>
>> I would like to use pdfbox and bouncycastle for managing pdf
>>
>> signatures. Parsing, validation, timestamping (PADES LTV) .
>>
>> We used itext for it, but it is under commercial licence.
>> Parsing signatures seems to be working (thanks to your advice). So I
>>
>> will try to create timestamp.
>>
>> Is it possible with pdfbox?  I found save method on PDDocument, but
>>
>> Iˇm afraid, that it can change bite representation of pdf, and
>> signatures become invalid. Is it true? What is right way to create
>> signature or timestamp with pdfbox?
>>
>>
>> Jan
>>
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Friday, October 10, 2014 10:44 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>>
>> Hi Jan,
>>
>> choosing the right technology is very important so I do understand
>>
>> your concerns. I had to make such decision about using PDFBox in the
>> past too.
>>
>> It can
>> If you have specific issues I can answer I’m happy to try to do so.
>> As
>>
>> a general statement PDFBox is used in production environments today
>> (as an example we ourselves are using it for a banking customer to
>> process account statements, an airline company to preprocess
>> archiving documents and various other customers).
>>
>>
>> PDFBox is continuously enhancing the parsing as we try to deal with
>>
>> real world PDF files which are not always inline with the the PDF
>> specification. Currently the best approach is to use
>> PDDocument.loadNonSeq (which parses documents according to the Xref
>> information) and in case of an exception PDDocument.load (which
>> parses sequentially). The Apache Tika project, which uses PDFBox for
>> parsing PDF’s, is running the parsing and text extraction against 50k
>> PDFs being made available via http://digitalcorpora.org
>>
>>
>> What is the application you would like to be using PDFBox for? Text
>>
>> Extraction, image conversion …. - I might be able to give you more
>> specific information for your use case.
>>
>>
>> BR
>>
>> Maruan
>>
>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Thank you Maruan, this function loads document.
>>
>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
>>
>> PDF parsing". I think correct parsing is very important, and I have
>> some doubts, if I can use pdfbox in production. Can you say something
>> to rest me :-).
>>
>>
>> Jan
>>
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Friday, October 10, 2014 9:25 AM
>> To: users@pdfbox.apache.or
>> Subject: Re: problem with pdf eof
>>
>> Hi
>>
>> you can try PDDocument.loadNonSeq(InputStream is, null)
>>
>> BR
>>
>> Maruan
>>
>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>
>> Hello,
>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
>>
>> PDF document in attachement.
>>
>> Method return without exception, but document model is incomplete.
>>
>> Problem is in characters after EOF (ofset 22939):
>> startxref
>> 22449
>> %%EOF
>> @
>> 16 0 obj
>> <<
>> /Type /Catalog
>>
>> PDFBox create internal IOException and ignore it with comment:
>>               /*
>>                * PDF files may have random data after the EOF
>>
>> marker. Ignore errors if
>>
>>                * last object processed is EOF.
>>                */
>>
>> Is this PDF construction valid?
>> Which parser in PDFBox is correct? I tried ConformingPDParser, but
>>
>> another error occured.
>>
>>
>> Jan
>>
>>
>>
>>
>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
>>
>> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu
>> tak není, nelze je považovat za jednání, které by zakládalo jakékoliv
>> nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze
>> uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako
>> příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je
>> důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím,
>> jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo
>> šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud
>> jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho
>> odesilateli a e-mail, včetně všech připojených souborů, vymažte.
>> Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o.
>> nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní
>> e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů
>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než
>> je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>
>>
>>
>>
>>
>>
>>
>>
>>


________________________________

Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..

Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
sorry if that has been unclear - as of now if you’d like to sign you have to use load() loadNonSeq() is not an option!

For all other cases use loadNonSeq() and if that fails load() as a fallback.

We are working on getting the missing signing support into nonSeq() but that will probably be after 2.0.

Now if you have parsing issues with load() please open an issue in Jira and attach the PDFs together with code to reproduce it. Same if you have parsing issues with loadNonSeq().

Of course if someone is willing to help getting that in … patches are welcome.

Maruan

Am 16.10.2014 um 20:13 schrieb Brzrk One <br...@gmail.com>:

> I hear dual advice here...
> - don't use NonSeq for signatures
> - but use NonSeq for multiple EOFs
> Files with both multiple EOFs and signatures will have problems...
> unless you mean we should parse 2x?
> 
> On Thu, Oct 16, 2014 at 12:12 PM, Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
> 
>> depends on the parser being used. NonSeq does follow the Xref information
>> and handles multiple EOFs (incremental updates) when parsing.
>> 
>> BR
>> Maruan
>> 
>> Am 16.10.2014 um 17:01 schrieb Brzrk One <br...@gmail.com>:
>> 
>> I've noticed that when there are multiple EOFs in the file, PDFBox parsing
>> is less reliable.
>> 
>> 
>> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:
>> 
>> When I use load insted of loadNoSeq, signatures are in this case  valid.
>> 
>> But for some documents load function doesnot read complete document. That
>> is why I used loadNoSeq. Some signatures are then missing.
>> 
>> Viz:
>> http://leteckaposta.cz/831516385
>> h1.pdf - original file (signature and timestamp)
>> h2.pdf - add first signature by pdfbox (timestamp is missing)
>> h3.pdf - add second signature by pdfbox (timestamp and previous signature
>> is missing)
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Thursday, October 16, 2014 2:37 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> when signing please make sure that you load the pdf using PDDocument.load
>> instead of PDDocument.loadNonSeq.
>> 
>> 
>> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Thursday, October 16, 2014 11:55 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> when you say invalid do you mean it’s corrupted or e.g. you get a
>> 
>> warning sign in Adobe Reader? Would you have a sample PDF?
>> 
>> 
>> When you sign a document and sign it again the first signature points to
>> 
>> a different document revision as you have changed the documents content
>> afterwards. So invalid in that context could mean that the warning you
>> might be getting is only reflecting that fact. Would need to see the
>> document to  understand what’s going on.
>> 
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>> Hi Maruan and others,
>> 
>> I created signature and it seems OK.
>> But when I create second signature (loadNonSeq, addSignature,
>> 
>> saveIncremental again), the first signature becomes invalid.
>> 
>> I think that there can be problem, that first page is updated (signatur
>> 
>> is invisible), but I dont understand it enough.
>> 
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Monday, October 13, 2014 4:09 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> Hi Jan,
>> 
>> there are sample in the examples package for various ways to sign a
>> 
>> document [1]. Signing a document needs incremental saving.
>> 
>> 
>> OTOH choosing the right solution should not be made on the base if
>> 
>> there is a license fee or not.
>> 
>> 
>> Maruan Sahyoun
>> 
>> [1]
>> 
>> 
>> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
>> 
>> 
>> 
>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>> Hi Maruan (and others),
>> 
>> I would like to use pdfbox and bouncycastle for managing pdf
>> 
>> signatures. Parsing, validation, timestamping (PADES LTV) .
>> 
>> We used itext for it, but it is under commercial licence.
>> Parsing signatures seems to be working (thanks to your advice). So I
>> 
>> will try to create timestamp.
>> 
>> Is it possible with pdfbox?  I found save method on PDDocument, but
>> 
>> Iˇm afraid, that it can change bite representation of pdf, and signatures
>> become invalid. Is it true? What is right way to create signature or
>> timestamp with pdfbox?
>> 
>> 
>> Jan
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Friday, October 10, 2014 10:44 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> Hi Jan,
>> 
>> choosing the right technology is very important so I do understand
>> 
>> your concerns. I had to make such decision about using PDFBox in the past
>> too.
>> 
>> It can
>> If you have specific issues I can answer I’m happy to try to do so. As
>> 
>> a general statement PDFBox is used in production environments today (as an
>> example we ourselves are using it for a banking customer to process account
>> statements, an airline company to preprocess archiving documents and
>> various other customers).
>> 
>> 
>> PDFBox is continuously enhancing the parsing as we try to deal with
>> 
>> real world PDF files which are not always inline with the the PDF
>> specification. Currently the best approach is to use PDDocument.loadNonSeq
>> (which parses documents according to the Xref information) and in case of
>> an exception PDDocument.load (which parses sequentially). The Apache Tika
>> project, which uses PDFBox for parsing PDF’s, is running the parsing and
>> text extraction against 50k PDFs being made available via
>> http://digitalcorpora.org
>> 
>> 
>> What is the application you would like to be using PDFBox for? Text
>> 
>> Extraction, image conversion …. - I might be able to give you more specific
>> information for your use case.
>> 
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>> Thank you Maruan, this function loads document.
>> 
>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
>> 
>> PDF parsing". I think correct parsing is very important, and I have some
>> doubts, if I can use pdfbox in production. Can you say something to rest me
>> :-).
>> 
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Friday, October 10, 2014 9:25 AM
>> To: users@pdfbox.apache.or
>> Subject: Re: problem with pdf eof
>> 
>> Hi
>> 
>> you can try PDDocument.loadNonSeq(InputStream is, null)
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>> Hello,
>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
>> 
>> PDF document in attachement.
>> 
>> Method return without exception, but document model is incomplete.
>> 
>> Problem is in characters after EOF (ofset 22939):
>> startxref
>> 22449
>> %%EOF
>> @
>> 16 0 obj
>> <<
>> /Type /Catalog
>> 
>> PDFBox create internal IOException and ignore it with comment:
>>               /*
>>                * PDF files may have random data after the EOF
>> 
>> marker. Ignore errors if
>> 
>>                * last object processed is EOF.
>>                */
>> 
>> Is this PDF construction valid?
>> Which parser in PDFBox is correct? I tried ConformingPDParser, but
>> 
>> another error occured.
>> 
>> 
>> Jan
>> 
>> 
>> 
>> 
>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
>> 
>> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak
>> není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky
>> vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a
>> dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah,
>> včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste
>> oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění,
>> reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu
>> všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte
>> to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených
>> souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP
>> Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně
>> pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů
>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je
>> daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita
>> pracovních aktivit a byla umožněna jejich kontrola..
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 


Re: problem with pdf eof

Posted by Brzrk One <br...@gmail.com>.
I hear dual advice here...
- don't use NonSeq for signatures
- but use NonSeq for multiple EOFs
Files with both multiple EOFs and signatures will have problems...
unless you mean we should parse 2x?

On Thu, Oct 16, 2014 at 12:12 PM, Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

> depends on the parser being used. NonSeq does follow the Xref information
> and handles multiple EOFs (incremental updates) when parsing.
>
> BR
> Maruan
>
> Am 16.10.2014 um 17:01 schrieb Brzrk One <br...@gmail.com>:
>
> I've noticed that when there are multiple EOFs in the file, PDFBox parsing
> is less reliable.
>
>
> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:
>
> When I use load insted of loadNoSeq, signatures are in this case  valid.
>
> But for some documents load function doesnot read complete document. That
> is why I used loadNoSeq. Some signatures are then missing.
>
> Viz:
> http://leteckaposta.cz/831516385
> h1.pdf - original file (signature and timestamp)
> h2.pdf - add first signature by pdfbox (timestamp is missing)
> h3.pdf - add second signature by pdfbox (timestamp and previous signature
> is missing)
>
> Jan
>
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Thursday, October 16, 2014 2:37 PM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
>
> when signing please make sure that you load the pdf using PDDocument.load
> instead of PDDocument.loadNonSeq.
>
>
> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>
>
>
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Thursday, October 16, 2014 11:55 AM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
>
> when you say invalid do you mean it’s corrupted or e.g. you get a
>
> warning sign in Adobe Reader? Would you have a sample PDF?
>
>
> When you sign a document and sign it again the first signature points to
>
> a different document revision as you have changed the documents content
> afterwards. So invalid in that context could mean that the warning you
> might be getting is only reflecting that fact. Would need to see the
> document to  understand what’s going on.
>
>
> BR
>
> Maruan
>
> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>
> Hi Maruan and others,
>
> I created signature and it seems OK.
> But when I create second signature (loadNonSeq, addSignature,
>
> saveIncremental again), the first signature becomes invalid.
>
> I think that there can be problem, that first page is updated (signatur
>
> is invisible), but I dont understand it enough.
>
>
> Jan
>
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Monday, October 13, 2014 4:09 PM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
>
> Hi Jan,
>
> there are sample in the examples package for various ways to sign a
>
> document [1]. Signing a document needs incremental saving.
>
>
> OTOH choosing the right solution should not be made on the base if
>
> there is a license fee or not.
>
>
> Maruan Sahyoun
>
> [1]
>
>
> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
>
>
>
> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>
> Hi Maruan (and others),
>
> I would like to use pdfbox and bouncycastle for managing pdf
>
> signatures. Parsing, validation, timestamping (PADES LTV) .
>
> We used itext for it, but it is under commercial licence.
> Parsing signatures seems to be working (thanks to your advice). So I
>
> will try to create timestamp.
>
> Is it possible with pdfbox?  I found save method on PDDocument, but
>
> Iˇm afraid, that it can change bite representation of pdf, and signatures
> become invalid. Is it true? What is right way to create signature or
> timestamp with pdfbox?
>
>
> Jan
>
>
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Friday, October 10, 2014 10:44 AM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
>
> Hi Jan,
>
> choosing the right technology is very important so I do understand
>
> your concerns. I had to make such decision about using PDFBox in the past
> too.
>
> It can
> If you have specific issues I can answer I’m happy to try to do so. As
>
> a general statement PDFBox is used in production environments today (as an
> example we ourselves are using it for a banking customer to process account
> statements, an airline company to preprocess archiving documents and
> various other customers).
>
>
> PDFBox is continuously enhancing the parsing as we try to deal with
>
> real world PDF files which are not always inline with the the PDF
> specification. Currently the best approach is to use PDDocument.loadNonSeq
> (which parses documents according to the Xref information) and in case of
> an exception PDDocument.load (which parses sequentially). The Apache Tika
> project, which uses PDFBox for parsing PDF’s, is running the parsing and
> text extraction against 50k PDFs being made available via
> http://digitalcorpora.org
>
>
> What is the application you would like to be using PDFBox for? Text
>
> Extraction, image conversion …. - I might be able to give you more specific
> information for your use case.
>
>
> BR
>
> Maruan
>
> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>
> Thank you Maruan, this function loads document.
>
> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
>
> PDF parsing". I think correct parsing is very important, and I have some
> doubts, if I can use pdfbox in production. Can you say something to rest me
> :-).
>
>
> Jan
>
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Friday, October 10, 2014 9:25 AM
> To: users@pdfbox.apache.or
> Subject: Re: problem with pdf eof
>
> Hi
>
> you can try PDDocument.loadNonSeq(InputStream is, null)
>
> BR
>
> Maruan
>
> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>
> Hello,
> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
>
> PDF document in attachement.
>
> Method return without exception, but document model is incomplete.
>
> Problem is in characters after EOF (ofset 22939):
> startxref
> 22449
> %%EOF
> @
> 16 0 obj
> <<
> /Type /Catalog
>
> PDFBox create internal IOException and ignore it with comment:
>                /*
>                 * PDF files may have random data after the EOF
>
> marker. Ignore errors if
>
>                 * last object processed is EOF.
>                 */
>
> Is this PDF construction valid?
> Which parser in PDFBox is correct? I tried ConformingPDParser, but
>
> another error occured.
>
>
> Jan
>
>
>
>
> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
>
> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak
> není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky
> vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a
> dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah,
> včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste
> oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění,
> reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu
> všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte
> to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených
> souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP
> Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně
> pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů
> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je
> daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita
> pracovních aktivit a byla umožněna jejich kontrola..
>
>
>
>
>
>
>
>
>

Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
depends on the parser being used. NonSeq does follow the Xref information and handles multiple EOFs (incremental updates) when parsing.

BR
Maruan

Am 16.10.2014 um 17:01 schrieb Brzrk One <br...@gmail.com>:

> I've noticed that when there are multiple EOFs in the file, PDFBox parsing
> is less reliable.
> 
> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:
> 
>> When I use load insted of loadNoSeq, signatures are in this case  valid.
>> 
>> But for some documents load function doesnot read complete document. That
>> is why I used loadNoSeq. Some signatures are then missing.
>> 
>> Viz:
>> http://leteckaposta.cz/831516385
>> h1.pdf - original file (signature and timestamp)
>> h2.pdf - add first signature by pdfbox (timestamp is missing)
>> h3.pdf - add second signature by pdfbox (timestamp and previous signature
>> is missing)
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>> Sent: Thursday, October 16, 2014 2:37 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> when signing please make sure that you load the pdf using PDDocument.load
>> instead of PDDocument.loadNonSeq.
>> 
>> 
>> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>> Sent: Thursday, October 16, 2014 11:55 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> when you say invalid do you mean it’s corrupted or e.g. you get a
>> warning sign in Adobe Reader? Would you have a sample PDF?
>>> 
>>> When you sign a document and sign it again the first signature points to
>> a different document revision as you have changed the documents content
>> afterwards. So invalid in that context could mean that the warning you
>> might be getting is only reflecting that fact. Would need to see the
>> document to  understand what’s going on.
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>>> Hi Maruan and others,
>>>> 
>>>> I created signature and it seems OK.
>>>> But when I create second signature (loadNonSeq, addSignature,
>> saveIncremental again), the first signature becomes invalid.
>>>> I think that there can be problem, that first page is updated (signatur
>> is invisible), but I dont understand it enough.
>>>> 
>>>> Jan
>>>> 
>>>> -----Original Message-----
>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>>> Sent: Monday, October 13, 2014 4:09 PM
>>>> To: users@pdfbox.apache.org
>>>> Subject: Re: problem with pdf eof
>>>> 
>>>> Hi Jan,
>>>> 
>>>> there are sample in the examples package for various ways to sign a
>> document [1]. Signing a document needs incremental saving.
>>>> 
>>>> OTOH choosing the right solution should not be made on the base if
>> there is a license fee or not.
>>>> 
>>>> Maruan Sahyoun
>>>> 
>>>> [1]
>> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
>>>> 
>>>> 
>>>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>>> 
>>>>> Hi Maruan (and others),
>>>>> 
>>>>> I would like to use pdfbox and bouncycastle for managing pdf
>> signatures. Parsing, validation, timestamping (PADES LTV) .
>>>>> We used itext for it, but it is under commercial licence.
>>>>> Parsing signatures seems to be working (thanks to your advice). So I
>> will try to create timestamp.
>>>>> Is it possible with pdfbox?  I found save method on PDDocument, but
>> Iˇm afraid, that it can change bite representation of pdf, and signatures
>> become invalid. Is it true? What is right way to create signature or
>> timestamp with pdfbox?
>>>>> 
>>>>> Jan
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>>>> Sent: Friday, October 10, 2014 10:44 AM
>>>>> To: users@pdfbox.apache.org
>>>>> Subject: Re: problem with pdf eof
>>>>> 
>>>>> Hi Jan,
>>>>> 
>>>>> choosing the right technology is very important so I do understand
>> your concerns. I had to make such decision about using PDFBox in the past
>> too.
>>>>> It can
>>>>> If you have specific issues I can answer I’m happy to try to do so. As
>> a general statement PDFBox is used in production environments today (as an
>> example we ourselves are using it for a banking customer to process account
>> statements, an airline company to preprocess archiving documents and
>> various other customers).
>>>>> 
>>>>> PDFBox is continuously enhancing the parsing as we try to deal with
>> real world PDF files which are not always inline with the the PDF
>> specification. Currently the best approach is to use PDDocument.loadNonSeq
>> (which parses documents according to the Xref information) and in case of
>> an exception PDDocument.load (which parses sequentially). The Apache Tika
>> project, which uses PDFBox for parsing PDF’s, is running the parsing and
>> text extraction against 50k PDFs being made available via
>> http://digitalcorpora.org
>>>>> 
>>>>> What is the application you would like to be using PDFBox for? Text
>> Extraction, image conversion …. - I might be able to give you more specific
>> information for your use case.
>>>>> 
>>>>> BR
>>>>> 
>>>>> Maruan
>>>>> 
>>>>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>>>> 
>>>>>> Thank you Maruan, this function loads document.
>>>>>> 
>>>>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
>> PDF parsing". I think correct parsing is very important, and I have some
>> doubts, if I can use pdfbox in production. Can you say something to rest me
>> :-).
>>>>>> 
>>>>>> Jan
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
>>>>>> Sent: Friday, October 10, 2014 9:25 AM
>>>>>> To: users@pdfbox.apache.or
>>>>>> Subject: Re: problem with pdf eof
>>>>>> 
>>>>>> Hi
>>>>>> 
>>>>>> you can try PDDocument.loadNonSeq(InputStream is, null)
>>>>>> 
>>>>>> BR
>>>>>> 
>>>>>> Maruan
>>>>>> 
>>>>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>>>>> 
>>>>>>> Hello,
>>>>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
>> PDF document in attachement.
>>>>>>> Method return without exception, but document model is incomplete.
>>>>>>> 
>>>>>>> Problem is in characters after EOF (ofset 22939):
>>>>>>> startxref
>>>>>>> 22449
>>>>>>> %%EOF
>>>>>>> @
>>>>>>> 16 0 obj
>>>>>>> <<
>>>>>>> /Type /Catalog
>>>>>>> 
>>>>>>> PDFBox create internal IOException and ignore it with comment:
>>>>>>>                /*
>>>>>>>                 * PDF files may have random data after the EOF
>> marker. Ignore errors if
>>>>>>>                 * last object processed is EOF.
>>>>>>>                 */
>>>>>>> 
>>>>>>> Is this PDF construction valid?
>>>>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but
>> another error occured.
>>>>>>> 
>>>>>>> Jan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
>> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak
>> není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky
>> vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a
>> dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah,
>> včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste
>> oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění,
>> reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu
>> všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte
>> to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených
>> souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP
>> Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně
>> pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů
>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je
>> daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita
>> pracovních aktivit a byla umožněna jejich kontrola..
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 


Re: problem with pdf eof

Posted by Brzrk One <br...@gmail.com>.
I've noticed that when there are multiple EOFs in the file, PDFBox parsing
is less reliable.

On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <Ja...@aipsafe.cz> wrote:

> When I use load insted of loadNoSeq, signatures are in this case  valid.
>
> But for some documents load function doesnot read complete document. That
> is why I used loadNoSeq. Some signatures are then missing.
>
> Viz:
> http://leteckaposta.cz/831516385
> h1.pdf - original file (signature and timestamp)
> h2.pdf - add first signature by pdfbox (timestamp is missing)
> h3.pdf - add second signature by pdfbox (timestamp and previous signature
> is missing)
>
> Jan
>
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> Sent: Thursday, October 16, 2014 2:37 PM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
>
> when signing please make sure that you load the pdf using PDDocument.load
> instead of PDDocument.loadNonSeq.
>
>
> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>
> >
> >
> > -----Original Message-----
> > From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> > Sent: Thursday, October 16, 2014 11:55 AM
> > To: users@pdfbox.apache.org
> > Subject: Re: problem with pdf eof
> >
> > when you say invalid do you mean it’s corrupted or e.g. you get a
> warning sign in Adobe Reader? Would you have a sample PDF?
> >
> > When you sign a document and sign it again the first signature points to
> a different document revision as you have changed the documents content
> afterwards. So invalid in that context could mean that the warning you
> might be getting is only reflecting that fact. Would need to see the
> document to  understand what’s going on.
> >
> > BR
> >
> > Maruan
> >
> > Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> >
> >> Hi Maruan and others,
> >>
> >> I created signature and it seems OK.
> >> But when I create second signature (loadNonSeq, addSignature,
> saveIncremental again), the first signature becomes invalid.
> >> I think that there can be problem, that first page is updated (signatur
> is invisible), but I dont understand it enough.
> >>
> >> Jan
> >>
> >> -----Original Message-----
> >> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> >> Sent: Monday, October 13, 2014 4:09 PM
> >> To: users@pdfbox.apache.org
> >> Subject: Re: problem with pdf eof
> >>
> >> Hi Jan,
> >>
> >> there are sample in the examples package for various ways to sign a
> document [1]. Signing a document needs incremental saving.
> >>
> >> OTOH choosing the right solution should not be made on the base if
> there is a license fee or not.
> >>
> >> Maruan Sahyoun
> >>
> >> [1]
> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
> >>
> >>
> >> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> >>
> >>> Hi Maruan (and others),
> >>>
> >>> I would like to use pdfbox and bouncycastle for managing pdf
> signatures. Parsing, validation, timestamping (PADES LTV) .
> >>> We used itext for it, but it is under commercial licence.
> >>> Parsing signatures seems to be working (thanks to your advice). So I
> will try to create timestamp.
> >>> Is it possible with pdfbox?  I found save method on PDDocument, but
> Iˇm afraid, that it can change bite representation of pdf, and signatures
> become invalid. Is it true? What is right way to create signature or
> timestamp with pdfbox?
> >>>
> >>> Jan
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> >>> Sent: Friday, October 10, 2014 10:44 AM
> >>> To: users@pdfbox.apache.org
> >>> Subject: Re: problem with pdf eof
> >>>
> >>> Hi Jan,
> >>>
> >>> choosing the right technology is very important so I do understand
> your concerns. I had to make such decision about using PDFBox in the past
> too.
> >>> It can
> >>> If you have specific issues I can answer I’m happy to try to do so. As
> a general statement PDFBox is used in production environments today (as an
> example we ourselves are using it for a banking customer to process account
> statements, an airline company to preprocess archiving documents and
> various other customers).
> >>>
> >>> PDFBox is continuously enhancing the parsing as we try to deal with
> real world PDF files which are not always inline with the the PDF
> specification. Currently the best approach is to use PDDocument.loadNonSeq
> (which parses documents according to the Xref information) and in case of
> an exception PDDocument.load (which parses sequentially). The Apache Tika
> project, which uses PDFBox for parsing PDF’s, is running the parsing and
> text extraction against 50k PDFs being made available via
> http://digitalcorpora.org
> >>>
> >>> What is the application you would like to be using PDFBox for? Text
> Extraction, image conversion …. - I might be able to give you more specific
> information for your use case.
> >>>
> >>> BR
> >>>
> >>> Maruan
> >>>
> >>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> >>>
> >>>> Thank you Maruan, this function loads document.
> >>>>
> >>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance
> PDF parsing". I think correct parsing is very important, and I have some
> doubts, if I can use pdfbox in production. Can you say something to rest me
> :-).
> >>>>
> >>>> Jan
> >>>>
> >>>> -----Original Message-----
> >>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de]
> >>>> Sent: Friday, October 10, 2014 9:25 AM
> >>>> To: users@pdfbox.apache.or
> >>>> Subject: Re: problem with pdf eof
> >>>>
> >>>> Hi
> >>>>
> >>>> you can try PDDocument.loadNonSeq(InputStream is, null)
> >>>>
> >>>> BR
> >>>>
> >>>> Maruan
> >>>>
> >>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> >>>>
> >>>>> Hello,
> >>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse
> PDF document in attachement.
> >>>>> Method return without exception, but document model is incomplete.
> >>>>>
> >>>>> Problem is in characters after EOF (ofset 22939):
> >>>>> startxref
> >>>>> 22449
> >>>>> %%EOF
> >>>>> @
> >>>>> 16 0 obj
> >>>>> <<
> >>>>> /Type /Catalog
> >>>>>
> >>>>> PDFBox create internal IOException and ignore it with comment:
> >>>>>                 /*
> >>>>>                  * PDF files may have random data after the EOF
> marker. Ignore errors if
> >>>>>                  * last object processed is EOF.
> >>>>>                  */
> >>>>>
> >>>>> Is this PDF construction valid?
> >>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but
> another error occured.
> >>>>>
> >>>>> Jan
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu
> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak
> není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky
> vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a
> dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah,
> včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste
> oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění,
> reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu
> všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte
> to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených
> souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP
> Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně
> pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů
> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je
> daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita
> pracovních aktivit a byla umožněna jejich kontrola..
> >>>>
> >>>
> >>
> >
>
>

RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.
When I use load insted of loadNoSeq, signatures are in this case  valid.

But for some documents load function doesnot read complete document. That is why I used loadNoSeq. Some signatures are then missing.

Viz:
http://leteckaposta.cz/831516385
h1.pdf - original file (signature and timestamp)
h2.pdf - add first signature by pdfbox (timestamp is missing)
h3.pdf - add second signature by pdfbox (timestamp and previous signature is missing)

Jan
 
-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Thursday, October 16, 2014 2:37 PM
To: users@pdfbox.apache.org
Subject: Re: problem with pdf eof

when signing please make sure that you load the pdf using PDDocument.load instead of PDDocument.loadNonSeq.


Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Thursday, October 16, 2014 11:55 AM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
> 
> when you say invalid do you mean it’s corrupted or e.g. you get a warning sign in Adobe Reader? Would you have a sample PDF?
> 
> When you sign a document and sign it again the first signature points to a different document revision as you have changed the documents content afterwards. So invalid in that context could mean that the warning you might be getting is only reflecting that fact. Would need to see the document to  understand what’s going on.
> 
> BR
> 
> Maruan
> 
> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Hi Maruan and others,
>> 
>> I created signature and it seems OK. 
>> But when I create second signature (loadNonSeq, addSignature, saveIncremental again), the first signature becomes invalid. 
>> I think that there can be problem, that first page is updated (signatur is invisible), but I dont understand it enough. 
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Monday, October 13, 2014 4:09 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> Hi Jan,
>> 
>> there are sample in the examples package for various ways to sign a document [1]. Signing a document needs incremental saving.
>> 
>> OTOH choosing the right solution should not be made on the base if there is a license fee or not. 
>> 
>> Maruan Sahyoun
>> 
>> [1] http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
>> 
>> 
>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Hi Maruan (and others),
>>> 
>>> I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
>>> We used itext for it, but it is under commercial licence.
>>> Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
>>> Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?
>>> 
>>> Jan
>>> 
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>> Sent: Friday, October 10, 2014 10:44 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi Jan,
>>> 
>>> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
>>> It can 
>>> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
>>> 
>>> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
>>> 
>>> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>>> Thank you Maruan, this function loads document.
>>>> 
>>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>>>> 
>>>> Jan
>>>> 
>>>> -----Original Message-----
>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>>> Sent: Friday, October 10, 2014 9:25 AM
>>>> To: users@pdfbox.apache.or
>>>> Subject: Re: problem with pdf eof
>>>> 
>>>> Hi 
>>>> 
>>>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>>>> 
>>>> BR
>>>> 
>>>> Maruan
>>>> 
>>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>>> 
>>>>> Hello,
>>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>>>> Method return without exception, but document model is incomplete.
>>>>> 
>>>>> Problem is in characters after EOF (ofset 22939):
>>>>> startxref
>>>>> 22449
>>>>> %%EOF
>>>>> @
>>>>> 16 0 obj
>>>>> << 
>>>>> /Type /Catalog
>>>>> 
>>>>> PDFBox create internal IOException and ignore it with comment:
>>>>>                 /*
>>>>>                  * PDF files may have random data after the EOF marker. Ignore errors if
>>>>>                  * last object processed is EOF.
>>>>>                  */
>>>>> 
>>>>> Is this PDF construction valid?
>>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>>>> 
>>>>> Jan
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>>> 
>>> 
>> 
> 


Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
when signing please make sure that you load the pdf using PDDocument.load instead of PDDocument.loadNonSeq.


Am 16.10.2014 um 11:57 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Thursday, October 16, 2014 11:55 AM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
> 
> when you say invalid do you mean it’s corrupted or e.g. you get a warning sign in Adobe Reader? Would you have a sample PDF?
> 
> When you sign a document and sign it again the first signature points to a different document revision as you have changed the documents content afterwards. So invalid in that context could mean that the warning you might be getting is only reflecting that fact. Would need to see the document to  understand what’s going on.
> 
> BR
> 
> Maruan
> 
> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Hi Maruan and others,
>> 
>> I created signature and it seems OK. 
>> But when I create second signature (loadNonSeq, addSignature, saveIncremental again), the first signature becomes invalid. 
>> I think that there can be problem, that first page is updated (signatur is invisible), but I dont understand it enough. 
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Monday, October 13, 2014 4:09 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> Hi Jan,
>> 
>> there are sample in the examples package for various ways to sign a document [1]. Signing a document needs incremental saving.
>> 
>> OTOH choosing the right solution should not be made on the base if there is a license fee or not. 
>> 
>> Maruan Sahyoun
>> 
>> [1] http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
>> 
>> 
>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Hi Maruan (and others),
>>> 
>>> I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
>>> We used itext for it, but it is under commercial licence.
>>> Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
>>> Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?
>>> 
>>> Jan
>>> 
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>> Sent: Friday, October 10, 2014 10:44 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi Jan,
>>> 
>>> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
>>> It can 
>>> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
>>> 
>>> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
>>> 
>>> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>>> Thank you Maruan, this function loads document.
>>>> 
>>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>>>> 
>>>> Jan
>>>> 
>>>> -----Original Message-----
>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>>> Sent: Friday, October 10, 2014 9:25 AM
>>>> To: users@pdfbox.apache.or
>>>> Subject: Re: problem with pdf eof
>>>> 
>>>> Hi 
>>>> 
>>>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>>>> 
>>>> BR
>>>> 
>>>> Maruan
>>>> 
>>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>>> 
>>>>> Hello,
>>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>>>> Method return without exception, but document model is incomplete.
>>>>> 
>>>>> Problem is in characters after EOF (ofset 22939):
>>>>> startxref
>>>>> 22449
>>>>> %%EOF
>>>>> @
>>>>> 16 0 obj
>>>>> << 
>>>>> /Type /Catalog
>>>>> 
>>>>> PDFBox create internal IOException and ignore it with comment:
>>>>>                 /*
>>>>>                  * PDF files may have random data after the EOF marker. Ignore errors if
>>>>>                  * last object processed is EOF.
>>>>>                  */
>>>>> 
>>>>> Is this PDF construction valid?
>>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>>>> 
>>>>> Jan
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>>> 
>>> 
>> 
> 


RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Thursday, October 16, 2014 11:55 AM
To: users@pdfbox.apache.org
Subject: Re: problem with pdf eof

when you say invalid do you mean it’s corrupted or e.g. you get a warning sign in Adobe Reader? Would you have a sample PDF?

When you sign a document and sign it again the first signature points to a different document revision as you have changed the documents content afterwards. So invalid in that context could mean that the warning you might be getting is only reflecting that fact. Would need to see the document to  understand what’s going on.

BR
 
Maruan

Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hi Maruan and others,
> 
> I created signature and it seems OK. 
> But when I create second signature (loadNonSeq, addSignature, saveIncremental again), the first signature becomes invalid. 
> I think that there can be problem, that first page is updated (signatur is invisible), but I dont understand it enough. 
> 
> Jan
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Monday, October 13, 2014 4:09 PM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
> 
> Hi Jan,
> 
> there are sample in the examples package for various ways to sign a document [1]. Signing a document needs incremental saving.
> 
> OTOH choosing the right solution should not be made on the base if there is a license fee or not. 
> 
> Maruan Sahyoun
> 
> [1] http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
> 
> 
> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Hi Maruan (and others),
>> 
>> I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
>> We used itext for it, but it is under commercial licence.
>> Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
>> Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?
>> 
>> Jan
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday, October 10, 2014 10:44 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> Hi Jan,
>> 
>> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
>> It can 
>> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
>> 
>> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
>> 
>> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Thank you Maruan, this function loads document.
>>> 
>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>> Sent: Friday, October 10, 2014 9:25 AM
>>> To: users@pdfbox.apache.or
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi 
>>> 
>>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>>> Hello,
>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>>> Method return without exception, but document model is incomplete.
>>>> 
>>>> Problem is in characters after EOF (ofset 22939):
>>>> startxref
>>>> 22449
>>>> %%EOF
>>>> @
>>>> 16 0 obj
>>>> << 
>>>> /Type /Catalog
>>>> 
>>>> PDFBox create internal IOException and ignore it with comment:
>>>>                  /*
>>>>                   * PDF files may have random data after the EOF marker. Ignore errors if
>>>>                   * last object processed is EOF.
>>>>                   */
>>>> 
>>>> Is this PDF construction valid?
>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>> 
>> 
> 


Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
when you say invalid do you mean it’s corrupted or e.g. you get a warning sign in Adobe Reader? Would you have a sample PDF?

When you sign a document and sign it again the first signature points to a different document revision as you have changed the documents content afterwards. So invalid in that context could mean that the warning you might be getting is only reflecting that fact. Would need to see the document to  understand what’s going on.

BR
 
Maruan

Am 16.10.2014 um 11:48 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hi Maruan and others,
> 
> I created signature and it seems OK. 
> But when I create second signature (loadNonSeq, addSignature, saveIncremental again), the first signature becomes invalid. 
> I think that there can be problem, that first page is updated (signatur is invisible), but I dont understand it enough. 
> 
> Jan
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Monday, October 13, 2014 4:09 PM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
> 
> Hi Jan,
> 
> there are sample in the examples package for various ways to sign a document [1]. Signing a document needs incremental saving.
> 
> OTOH choosing the right solution should not be made on the base if there is a license fee or not. 
> 
> Maruan Sahyoun
> 
> [1] http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
> 
> 
> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Hi Maruan (and others),
>> 
>> I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
>> We used itext for it, but it is under commercial licence.
>> Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
>> Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?
>> 
>> Jan
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday, October 10, 2014 10:44 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: problem with pdf eof
>> 
>> Hi Jan,
>> 
>> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
>> It can 
>> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
>> 
>> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
>> 
>> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Thank you Maruan, this function loads document.
>>> 
>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>> Sent: Friday, October 10, 2014 9:25 AM
>>> To: users@pdfbox.apache.or
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi 
>>> 
>>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>>> Hello,
>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>>> Method return without exception, but document model is incomplete.
>>>> 
>>>> Problem is in characters after EOF (ofset 22939):
>>>> startxref
>>>> 22449
>>>> %%EOF
>>>> @
>>>> 16 0 obj
>>>> << 
>>>> /Type /Catalog
>>>> 
>>>> PDFBox create internal IOException and ignore it with comment:
>>>>                  /*
>>>>                   * PDF files may have random data after the EOF marker. Ignore errors if
>>>>                   * last object processed is EOF.
>>>>                   */
>>>> 
>>>> Is this PDF construction valid?
>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>> 
>> 
> 


RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.
Hi Maruan and others,

I created signature and it seems OK. 
But when I create second signature (loadNonSeq, addSignature, saveIncremental again), the first signature becomes invalid. 
I think that there can be problem, that first page is updated (signatur is invisible), but I dont understand it enough. 

Jan

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Monday, October 13, 2014 4:09 PM
To: users@pdfbox.apache.org
Subject: Re: problem with pdf eof

Hi Jan,

there are sample in the examples package for various ways to sign a document [1]. Signing a document needs incremental saving.

OTOH choosing the right solution should not be made on the base if there is a license fee or not. 

Maruan Sahyoun

[1] http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/


Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hi Maruan (and others),
> 
> I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
> We used itext for it, but it is under commercial licence.
> Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
> Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?
> 
> Jan
> 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Friday, October 10, 2014 10:44 AM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
> 
> Hi Jan,
> 
> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
> It can 
> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
> 
> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
> 
> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
> 
> BR
> 
> Maruan
> 
> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Thank you Maruan, this function loads document.
>> 
>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday, October 10, 2014 9:25 AM
>> To: users@pdfbox.apache.or
>> Subject: Re: problem with pdf eof
>> 
>> Hi 
>> 
>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Hello,
>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>> Method return without exception, but document model is incomplete.
>>> 
>>> Problem is in characters after EOF (ofset 22939):
>>> startxref
>>> 22449
>>> %%EOF
>>> @
>>> 16 0 obj
>>> << 
>>> /Type /Catalog
>>> 
>>> PDFBox create internal IOException and ignore it with comment:
>>>                   /*
>>>                    * PDF files may have random data after the EOF marker. Ignore errors if
>>>                    * last object processed is EOF.
>>>                    */
>>> 
>>> Is this PDF construction valid?
>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>> 
> 


Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Jan,

there are sample in the examples package for various ways to sign a document [1]. Signing a document needs incremental saving.

OTOH choosing the right solution should not be made on the base if there is a license fee or not. 

Maruan Sahyoun

[1] http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/


Am 13.10.2014 um 16:02 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hi Maruan (and others),
> 
> I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
> We used itext for it, but it is under commercial licence.
> Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
> Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?
> 
> Jan
> 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Friday, October 10, 2014 10:44 AM
> To: users@pdfbox.apache.org
> Subject: Re: problem with pdf eof
> 
> Hi Jan,
> 
> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
> It can 
> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
> 
> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
> 
> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
> 
> BR
> 
> Maruan
> 
> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Thank you Maruan, this function loads document.
>> 
>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday, October 10, 2014 9:25 AM
>> To: users@pdfbox.apache.or
>> Subject: Re: problem with pdf eof
>> 
>> Hi 
>> 
>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Hello,
>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>> Method return without exception, but document model is incomplete.
>>> 
>>> Problem is in characters after EOF (ofset 22939):
>>> startxref
>>> 22449
>>> %%EOF
>>> @
>>> 16 0 obj
>>> << 
>>> /Type /Catalog
>>> 
>>> PDFBox create internal IOException and ignore it with comment:
>>>                   /*
>>>                    * PDF files may have random data after the EOF marker. Ignore errors if
>>>                    * last object processed is EOF.
>>>                    */
>>> 
>>> Is this PDF construction valid?
>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>> 
> 


RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.
Hi Maruan (and others),

I would like to use pdfbox and bouncycastle for managing pdf signatures. Parsing, validation, timestamping (PADES LTV) . 
We used itext for it, but it is under commercial licence.
Parsing signatures seems to be working (thanks to your advice). So I will try to create timestamp. 
Is it possible with pdfbox?  I found save method on PDDocument, but Iˇm afraid, that it can change bite representation of pdf, and signatures become invalid. Is it true? What is right way to create signature or timestamp with pdfbox?

 Jan


-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Friday, October 10, 2014 10:44 AM
To: users@pdfbox.apache.org
Subject: Re: problem with pdf eof

Hi Jan,

choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
 It can 
If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 

PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org

What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.

BR

Maruan

Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Thank you Maruan, this function loads document.
> 
> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
> 
> Jan
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Friday, October 10, 2014 9:25 AM
> To: users@pdfbox.apache.or
> Subject: Re: problem with pdf eof
> 
> Hi 
> 
> you can try PDDocument.loadNonSeq(InputStream is, null) 
> 
> BR
> 
> Maruan
> 
> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Hello,
>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>> Method return without exception, but document model is incomplete.
>> 
>> Problem is in characters after EOF (ofset 22939):
>> startxref
>> 22449
>> %%EOF
>> @
>> 16 0 obj
>> << 
>> /Type /Catalog
>> 
>> PDFBox create internal IOException and ignore it with comment:
>>                    /*
>>                     * PDF files may have random data after the EOF marker. Ignore errors if
>>                     * last object processed is EOF.
>>                     */
>> 
>> Is this PDF construction valid?
>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>> 
>> Jan
>> 
>> 
>> 
>> 
>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
> 


Re: problem with pdf eof

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
Thanks Marc, we'll take this offlist...
... copying my colleagues

On Fri, Oct 10, 2014 at 3:52 PM, Marc Davis <ma...@gmail.com> wrote:

> Peter, being an Organofluorine Chemist, this is precisely what we are
> seeking - being able to extract PDFs that contain organic structures along
> with text and tables,


We are  actively developing this. See
https://bitbucket.org/petermr/xhtml2stm/wiki/Home (ChemVisitor) where we
are extracting molecules from documents where the PDF contains Paths, not
Pixels. Paths result when the author (human or program) has used a vector
drawing tool (ChemDraw does this and so do most others).

The publisher MAY carry the vector information to the PDF (BioMedCentral
and NaturePublishingGroup does this) while other publishers (e.g. Am.
Chem.Soc.) translate these to pixel images. The first is easier but we have
made progress with interpreting the second. The results is
ChemicalMarkupLanguage which is XML.

we need to place extract this data and transfer into readable Word docs.


machine-readable or human-readable or both?


>   I guess in this case, since it’s XML, the .docx format is a lot easier
> to create.
>

Not really. Modern Word and word-like files use XML as the basis. If they
are well created they can contain semantic spectra, etc. But first we have
to extract those.


> Thanks,
> Marc
>
>
The main problem is actually sociopoliticolegal.  Until this year it has
not been clear whether it's legal to extract factual material from
copyright documents. Now, in the UK, it IS - assuming it's used for
non-commercial research. So we are starting to do this on a - hopefully -
massive scale and generating a whole new research area - knowledge-driven
research.

The value of this for this list is it validates all the hard work done by
list members in writing PDFBox. Because the process is now legal in UK
there is more incentive to develop and publish downstream analytic tools
and that's what we are doing (Apache2-Open, of course).



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: problem with pdf eof

Posted by Marc Davis <ma...@gmail.com>.
Peter, being an Organofluorine Chemist, this is precisely what we are seeking - being able to extract PDFs that contain organic structures along with text and tables, we need to place extract this data and transfer into readable Word docs.  I guess in this case, since it’s XML, the .docx format is a lot easier to create.

Thanks,
Marc



On Oct 10, 2014, at 10:05 AM, Peter Murray-Rust <pm...@cam.ac.uk> wrote:

> On Fri, Oct 10, 2014 at 2:33 PM, Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
> 
>> Hi Marc,
>> 
>> text and image extraction is one of the regular use cases. Keeping the
>> formatting is also possible but there is a different concept behind the PDF
>> format and text processing. E.g. what is a paragraph within a text
>> processor might be individually placed characters (glyphs) within a PDF
>> file. You might want to look into PDFStreamEngine and it’s subclasses how
>> to process graphics and text information of a PDF.
>> 
>> Another sample is PDF2SVG which uses PDFBox [
>> https://bitbucket.org/petermr/pdf2svg/wiki/Home]
>> 
> 
> Thanks for the link. see also http://www.contentmine.org
> 
> The PDF2SVG project is active and the first part of a pipeline which
> includes:
> 
> PDF -> (SVG, PNG) -> (SVG, XHTML, PNG) -> (SVG, XHTML, SVG) (where bitmaps
> have been converted to SVG) -> (Shapes, Text) -> Semantic Documents ->
> Science
> 
> We are now able to take (most) PDFs and extract primitives which are
> heuristically combined to create Characters and Paths, which are combined
> to Shapes and Text. This is structured into XHTML, along with
> sub/superscripts and styling (italics). In favourable cases we can extract
> semantic science (currently evolutionary trees from pixel diagrams in PDFs,
> and chemical reactions also from pixels in PDFs).
> 
> 
> We have to do a significant amount of OCR because (a) diagrams have
> characters in pixels and (b) scientific publishers use the worst-ever
> non-compliant Fonts in their PDFs. This means we have to guess the
> character / codePoint from the outline glyph or pixel map.
> 
> Some of this is good beta, some is raw alpha. We'd be delighted if anyone
> is interested in hacking pixels or glyph outlines in PDFs - it's painful
> but you get a warm glow of having helped the human race. Same goes for
> tables and document structuring...
> 
> BR
> 
> P
> 
> 
> 
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069


Re: problem with pdf eof

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
On Fri, Oct 10, 2014 at 2:33 PM, Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

> Hi Marc,
>
> text and image extraction is one of the regular use cases. Keeping the
> formatting is also possible but there is a different concept behind the PDF
> format and text processing. E.g. what is a paragraph within a text
> processor might be individually placed characters (glyphs) within a PDF
> file. You might want to look into PDFStreamEngine and it’s subclasses how
> to process graphics and text information of a PDF.
>
> Another sample is PDF2SVG which uses PDFBox [
> https://bitbucket.org/petermr/pdf2svg/wiki/Home]
>

Thanks for the link. see also http://www.contentmine.org

The PDF2SVG project is active and the first part of a pipeline which
includes:

PDF -> (SVG, PNG) -> (SVG, XHTML, PNG) -> (SVG, XHTML, SVG) (where bitmaps
have been converted to SVG) -> (Shapes, Text) -> Semantic Documents ->
Science

We are now able to take (most) PDFs and extract primitives which are
heuristically combined to create Characters and Paths, which are combined
to Shapes and Text. This is structured into XHTML, along with
sub/superscripts and styling (italics). In favourable cases we can extract
semantic science (currently evolutionary trees from pixel diagrams in PDFs,
and chemical reactions also from pixels in PDFs).


We have to do a significant amount of OCR because (a) diagrams have
characters in pixels and (b) scientific publishers use the worst-ever
non-compliant Fonts in their PDFs. This means we have to guess the
character / codePoint from the outline glyph or pixel map.

Some of this is good beta, some is raw alpha. We'd be delighted if anyone
is interested in hacking pixels or glyph outlines in PDFs - it's painful
but you get a warm glow of having helped the human race. Same goes for
tables and document structuring...

BR

P




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Marc,

text and image extraction is one of the regular use cases. Keeping the formatting is also possible but there is a different concept behind the PDF format and text processing. E.g. what is a paragraph within a text processor might be individually placed characters (glyphs) within a PDF file. You might want to look into PDFStreamEngine and it’s subclasses how to process graphics and text information of a PDF.

Another sample is PDF2SVG which uses PDFBox [https://bitbucket.org/petermr/pdf2svg/wiki/Home]

BR

Maruan

Am 10.10.2014 um 14:36 schrieb Marc Davis <ma...@gmail.com>:

> Maruan,
> 
> We’ve been thinking of using PDFBox as a PDF to Doc/x converter - it this tool ready for prime-time since the MS formats are such a pain to work with?  I would appreciate your thoughts.
> 
> Essentially, our objective is to extract text and image while retaining some basic formatting. I think the challenge is in the latter.
> 
> Thanks,
> Marc
> 
> 
> 
> On Oct 10, 2014, at 4:43 AM, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> 
>> Hi Jan,
>> 
>> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
>> 
>> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
>> 
>> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
>> 
>> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Thank you Maruan, this function loads document.
>>> 
>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>> Sent: Friday, October 10, 2014 9:25 AM
>>> To: users@pdfbox.apache.or
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi 
>>> 
>>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>>> 
>>>> Hello,
>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>>> Method return without exception, but document model is incomplete.
>>>> 
>>>> Problem is in characters after EOF (ofset 22939):
>>>> startxref
>>>> 22449
>>>> %%EOF
>>>> @
>>>> 16 0 obj
>>>> << 
>>>> /Type /Catalog
>>>> 
>>>> PDFBox create internal IOException and ignore it with comment:
>>>>                  /*
>>>>                   * PDF files may have random data after the EOF marker. Ignore errors if
>>>>                   * last object processed is EOF.
>>>>                   */
>>>> 
>>>> Is this PDF construction valid?
>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>>> 
>> 
> 


Re: problem with pdf eof

Posted by Marc Davis <ma...@gmail.com>.
Maruan,

We’ve been thinking of using PDFBox as a PDF to Doc/x converter - it this tool ready for prime-time since the MS formats are such a pain to work with?  I would appreciate your thoughts.

Essentially, our objective is to extract text and image while retaining some basic formatting. I think the challenge is in the latter.

Thanks,
Marc



On Oct 10, 2014, at 4:43 AM, Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> Hi Jan,
> 
> choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 
> 
> If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 
> 
> PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org
> 
> What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.
> 
> BR
> 
> Maruan
> 
> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Thank you Maruan, this function loads document.
>> 
>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
>> 
>> Jan
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday, October 10, 2014 9:25 AM
>> To: users@pdfbox.apache.or
>> Subject: Re: problem with pdf eof
>> 
>> Hi 
>> 
>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
>> 
>>> Hello,
>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>>> Method return without exception, but document model is incomplete.
>>> 
>>> Problem is in characters after EOF (ofset 22939):
>>> startxref
>>> 22449
>>> %%EOF
>>> @
>>> 16 0 obj
>>> << 
>>> /Type /Catalog
>>> 
>>> PDFBox create internal IOException and ignore it with comment:
>>>                   /*
>>>                    * PDF files may have random data after the EOF marker. Ignore errors if
>>>                    * last object processed is EOF.
>>>                    */
>>> 
>>> Is this PDF construction valid?
>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
>> 
> 


Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Jan,

choosing the right technology is very important so I do understand your concerns. I had to make such decision about using PDFBox in the past too. 

If you have specific issues I can answer I’m happy to try to do so. As a general statement PDFBox is used in production environments today (as an example we ourselves are using it for a banking customer to process account statements, an airline company to preprocess archiving documents and various other customers). 

PDFBox is continuously enhancing the parsing as we try to deal with real world PDF files which are not always inline with the the PDF specification. Currently the best approach is to use PDDocument.loadNonSeq (which parses documents according to the Xref information) and in case of an exception PDDocument.load (which parses sequentially). The Apache Tika project, which uses PDFBox for parsing PDF’s, is running the parsing and text extraction against 50k PDFs being made available via http://digitalcorpora.org

What is the application you would like to be using PDFBox for? Text Extraction, image conversion …. - I might be able to give you more specific information for your use case.

BR

Maruan

Am 10.10.2014 um 10:10 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Thank you Maruan, this function loads document.
> 
> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).
> 
> Jan
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Friday, October 10, 2014 9:25 AM
> To: users@pdfbox.apache.or
> Subject: Re: problem with pdf eof
> 
> Hi 
> 
> you can try PDDocument.loadNonSeq(InputStream is, null) 
> 
> BR
> 
> Maruan
> 
> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:
> 
>> Hello,
>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
>> Method return without exception, but document model is incomplete.
>> 
>> Problem is in characters after EOF (ofset 22939):
>> startxref
>> 22449
>> %%EOF
>> @
>> 16 0 obj
>> << 
>> /Type /Catalog
>> 
>> PDFBox create internal IOException and ignore it with comment:
>>                    /*
>>                     * PDF files may have random data after the EOF marker. Ignore errors if
>>                     * last object processed is EOF.
>>                     */
>> 
>> Is this PDF construction valid?
>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>> 
>> Jan
>> 
>> 
>> 
>> 
>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..
> 


RE: problem with pdf eof

Posted by Vomlel Jan <Ja...@aipsafe.cz>.
Thank you Maruan, this function loads document.

I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF parsing". I think correct parsing is very important, and I have some doubts, if I can use pdfbox in production. Can you say something to rest me :-).

Jan
 
-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Friday, October 10, 2014 9:25 AM
To: users@pdfbox.apache.or
Subject: Re: problem with pdf eof

Hi 

you can try PDDocument.loadNonSeq(InputStream is, null) 

BR

Maruan

Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hello,
> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
> Method return without exception, but document model is incomplete.
>  
> Problem is in characters after EOF (ofset 22939):
> startxref
> 22449
> %%EOF
> @
> 16 0 obj
> << 
> /Type /Catalog
>  
> PDFBox create internal IOException and ignore it with comment:
>                     /*
>                      * PDF files may have random data after the EOF marker. Ignore errors if
>                      * last object processed is EOF.
>                      */
>  
> Is this PDF construction valid?
> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>  
> Jan
>  
>  
> 
> 
> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..


Re: problem with pdf eof

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi 

you can try PDDocument.loadNonSeq(InputStream is, null) 

BR

Maruan

Am 10.10.2014 um 09:09 schrieb Vomlel Jan <Ja...@aipsafe.cz>:

> Hello,
> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF document in attachement.
> Method return without exception, but document model is incomplete.
>  
> Problem is in characters after EOF (ofset 22939):
> startxref
> 22449
> %%EOF
> @
> 16 0 obj
> << 
> /Type /Catalog
>  
> PDFBox create internal IOException and ignore it with comment:
>                     /*
>                      * PDF files may have random data after the EOF marker. Ignore errors if
>                      * last object processed is EOF.
>                      */
>  
> Is this PDF construction valid?
> Which parser in PDFBox is correct? I tried ConformingPDParser, but another error occured.
>  
> Jan
>  
>  
> 
> 
> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních aktivit a byla umožněna jejich kontrola..