You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Daniel Wilson <wi...@gmail.com> on 2009/05/13 20:51:06 UTC
Requests for parser to be more forgiving
As you've doubtless seen, Sean is coming up with quite a set of invalid data
scenarios that crash the PDFBox parser.
As a matter of policy, what do you all think should be our handling of
these? I see 3 options, though I'm open to others:
1. Crash -- current functionality
2. Bury the error
3. Log and continue
I favor #3, but before I include code to move to that in all those places
Sean is finding, would like other developer input.
Thanks.
Daniel Wilson
Re: Requests for parser to be more forgiving
Posted by Andreas Lehmkühler <an...@lehmi.de>.
Daniel Wilson schrieb:
> As you've doubtless seen, Sean is coming up with quite a set of invalid data
> scenarios that crash the PDFBox parser.
>
> As a matter of policy, what do you all think should be our handling of
> these? I see 3 options, though I'm open to others:
>
> 1. Crash -- current functionality
> 2. Bury the error
> 3. Log and continue
>
> I favor #3, but before I include code to move to that in all those places
> Sean is finding, would like other developer input.
I favor #3 too. But only if the costs to be more forgiving aren't too
high. I'm espacially thinking of complex parser scenarios. It could be
quite difficult to just skip *every* malformed pdf.
BR
Andreas Lehmkühler
> Thanks.
>
> Daniel Wilson
>
Re: Requests for parser to be more forgiving
Posted by Ken Weinert <ke...@quarter-flash.com>.
Daniel Wilson wrote:
> 1. Crash -- current functionality
> 2. Bury the error
> 3. Log and continue
>
I'm not a major (most likely not even minor :) developer here, but I've
worked with a different library and have struggled with a lot of
different PDF issues over the years.
One thing to keep in mind when talking about this is that most PDF
readers (and acroread in particular) are *extremely* forgiving about
what they take as input and usually only error out if it is absolutely
impossible to continue on.
This only makes a difference as you'll get a lot of comments along the
lines of "well, I can display it fine, why can't PDFBox read it? PDFBox
must be broken." At least that's been my experience.
I'm inclined to head in the direction of handle errors with lenience,
but tell the user what's actually wrong with the file.
Even if the file isn't broken, it can be constructed in an odd, but
still proper, way. My favorite was a particular PDF that always crashed
a tool we were using. It displayed fine, you couldn't see any anomalies
in the file or anything. It turns out that someone couldn't figure out
how to draw a horizontal line so they drew a line that was 400 pixels
wide and 2 pixels tall. This was outside what whoever wrote the tool
had thought of and it didn't handle it well.
--
Ken Weinert