You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2013/05/15 14:56:51 UTC

[DISCUSS] PDF conformance

Hi,

currently PDFBox has a number of workarounds "hidden" in the code for real world PDFs (e.g. PDFBOX-1172) which are not inline with the spec. There are several options to deal with that

e.g.
a) keep the workarounds in the core code
b) throw an exception and stop working
c) handle it through a pluggable extension

WDYT?

Maruan Sahyoun


Re: [DISCUSS] PDF conformance

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Am 21.05.2013 um 08:00 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

> Hi,
> 
> Am 15.05.2013 14:56, schrieb Maruan Sahyoun:
>> Hi,
>> 
>> currently PDFBox has a number of workarounds "hidden" in the code for real world PDFs (e.g. PDFBOX-1172) which are not inline with the spec. There are several options to deal with that
>> 
>> e.g.
>> a) keep the workarounds in the core code
> IMO we can't drop them. Whenever a parsing issue arises people often
> argue that all pdf readers but PDFbox are able to handle the pdf in
> question. So people expect that a pdf reader works in any situation
> wether the pdf follows the spec or not. That's sad but that's life :-(

I agree that as long as Adobe Reader or e.g. Firefox (pdf.js) can handle the pdf we should be able handle them too.

> 
>> b) throw an exception and stop working
> We should add some (special) logging, so that one can detect such glitches.
> 

OK

>> c) handle it through a pluggable extension
> I'm not sure if there is one solution for every use case. Sometimes it's just a
> question of the used format (e.g. PDFBOX-1172) and sometimes there are bigger
> differences.
> 

Wouldn't be a solution to every use case. I thought about PDF's with parsing exceptions. E.g. currently there is a workaround code of different kind for real world PDFs.

Some are handled by calling specialized routines 
# e.g. checkForMissingCloseParen in BaseParser

Some are handle inline
# line 483 in PDFParser for %%EOF handling
# line 548 in PDFParser for handling 'obj'
# line 733 in PDFParser for incorrect xref table entry


So the extension was meant to 
a) have a clean conforming pdf parser and
b) handle these exceptions to the PDF spec in specialized routines. 

Now by thinking about these routines we could do it within the parser similar to checkForMissingCloseParen or by registering handlers for such situations. 

Benefit:
# core objects/methods are clean from a conforming PDF perspective
# extensions stand out clearly
# easier to add handling of special situations
# developers could add their own special handling

Drawback:
# more complex architecture
# no single handling of real world parsing
# runtime performance impact


>> WDYT?
>> 
>> Maruan Sahyoun
> 
> BR
> Andreas Lehmkühler
> 


Re: [DISCUSS] PDF conformance

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 15.05.2013 14:56, schrieb Maruan Sahyoun:
> Hi,
>
> currently PDFBox has a number of workarounds "hidden" in the code for real world PDFs (e.g. PDFBOX-1172) which are not inline with the spec. There are several options to deal with that
>
> e.g.
> a) keep the workarounds in the core code
IMO we can't drop them. Whenever a parsing issue arises people often
argue that all pdf readers but PDFbox are able to handle the pdf in
question. So people expect that a pdf reader works in any situation
wether the pdf follows the spec or not. That's sad but that's life :-(

> b) throw an exception and stop working
We should add some (special) logging, so that one can detect such glitches.

> c) handle it through a pluggable extension
I'm not sure if there is one solution for every use case. Sometimes it's just a
question of the used format (e.g. PDFBOX-1172) and sometimes there are bigger
differences.

> WDYT?
>
> Maruan Sahyoun

BR
Andreas Lehmkühler