You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2013/05/15 14:56:51 UTC
[DISCUSS] PDF conformance
Hi,
currently PDFBox has a number of workarounds "hidden" in the code for real world PDFs (e.g. PDFBOX-1172) which are not inline with the spec. There are several options to deal with that
e.g.
a) keep the workarounds in the core code
b) throw an exception and stop working
c) handle it through a pluggable extension
WDYT?
Maruan Sahyoun
Re: [DISCUSS] PDF conformance
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Maruan Sahyoun
FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen
Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de
Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827
Am 21.05.2013 um 08:00 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
> Hi,
>
> Am 15.05.2013 14:56, schrieb Maruan Sahyoun:
>> Hi,
>>
>> currently PDFBox has a number of workarounds "hidden" in the code for real world PDFs (e.g. PDFBOX-1172) which are not inline with the spec. There are several options to deal with that
>>
>> e.g.
>> a) keep the workarounds in the core code
> IMO we can't drop them. Whenever a parsing issue arises people often
> argue that all pdf readers but PDFbox are able to handle the pdf in
> question. So people expect that a pdf reader works in any situation
> wether the pdf follows the spec or not. That's sad but that's life :-(
I agree that as long as Adobe Reader or e.g. Firefox (pdf.js) can handle the pdf we should be able handle them too.
>
>> b) throw an exception and stop working
> We should add some (special) logging, so that one can detect such glitches.
>
OK
>> c) handle it through a pluggable extension
> I'm not sure if there is one solution for every use case. Sometimes it's just a
> question of the used format (e.g. PDFBOX-1172) and sometimes there are bigger
> differences.
>
Wouldn't be a solution to every use case. I thought about PDF's with parsing exceptions. E.g. currently there is a workaround code of different kind for real world PDFs.
Some are handled by calling specialized routines
# e.g. checkForMissingCloseParen in BaseParser
Some are handle inline
# line 483 in PDFParser for %%EOF handling
# line 548 in PDFParser for handling 'obj'
# line 733 in PDFParser for incorrect xref table entry
So the extension was meant to
a) have a clean conforming pdf parser and
b) handle these exceptions to the PDF spec in specialized routines.
Now by thinking about these routines we could do it within the parser similar to checkForMissingCloseParen or by registering handlers for such situations.
Benefit:
# core objects/methods are clean from a conforming PDF perspective
# extensions stand out clearly
# easier to add handling of special situations
# developers could add their own special handling
Drawback:
# more complex architecture
# no single handling of real world parsing
# runtime performance impact
>> WDYT?
>>
>> Maruan Sahyoun
>
> BR
> Andreas Lehmkühler
>
Re: [DISCUSS] PDF conformance
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,
Am 15.05.2013 14:56, schrieb Maruan Sahyoun:
> Hi,
>
> currently PDFBox has a number of workarounds "hidden" in the code for real world PDFs (e.g. PDFBOX-1172) which are not inline with the spec. There are several options to deal with that
>
> e.g.
> a) keep the workarounds in the core code
IMO we can't drop them. Whenever a parsing issue arises people often
argue that all pdf readers but PDFbox are able to handle the pdf in
question. So people expect that a pdf reader works in any situation
wether the pdf follows the spec or not. That's sad but that's life :-(
> b) throw an exception and stop working
We should add some (special) logging, so that one can detect such glitches.
> c) handle it through a pluggable extension
I'm not sure if there is one solution for every use case. Sometimes it's just a
question of the used format (e.g. PDFBOX-1172) and sometimes there are bigger
differences.
> WDYT?
>
> Maruan Sahyoun
BR
Andreas Lehmkühler