You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Daniel Wilson <wi...@gmail.com> on 2009/05/13 20:51:06 UTC

Requests for parser to be more forgiving

As you've doubtless seen, Sean is coming up with quite a set of invalid data
scenarios that crash the PDFBox parser.

As a matter of policy, what do you all think should be our handling of
these?  I see 3 options, though I'm open to others:

   1. Crash -- current functionality
   2. Bury the error
   3. Log and continue

I favor #3, but before I include code to move to that in all those places
Sean is finding, would like other developer input.

Thanks.

Daniel Wilson

Re: Requests for parser to be more forgiving

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Daniel Wilson schrieb:
> As you've doubtless seen, Sean is coming up with quite a set of invalid data
> scenarios that crash the PDFBox parser.
> 
> As a matter of policy, what do you all think should be our handling of
> these?  I see 3 options, though I'm open to others:
> 
>    1. Crash -- current functionality
>    2. Bury the error
>    3. Log and continue
> 
> I favor #3, but before I include code to move to that in all those places
> Sean is finding, would like other developer input.
I favor #3 too. But only if the costs to be more forgiving aren't too 
high. I'm espacially thinking of complex parser scenarios. It could be 
quite difficult to just skip *every* malformed pdf.

BR
Andreas Lehmkühler

> Thanks.
> 
> Daniel Wilson
> 

Re: Requests for parser to be more forgiving

Posted by Ken Weinert <ke...@quarter-flash.com>.
Daniel Wilson wrote:
>    1. Crash -- current functionality
>    2. Bury the error
>    3. Log and continue
>   
I'm not a major (most likely not even minor :) developer here, but I've 
worked with a different library and have struggled with a lot of 
different PDF issues over the years.

One thing to keep in mind when talking about this is that most PDF 
readers (and acroread in particular) are *extremely* forgiving about 
what they take as input and usually only error out if it is absolutely 
impossible to continue on.

This only makes a difference as you'll get a lot of comments along the 
lines of "well, I can display it fine, why can't PDFBox read it? PDFBox 
must be broken." At least that's been my experience.

I'm inclined to head in the direction of handle errors with lenience, 
but tell the user what's actually wrong with the file.

Even if the file isn't broken, it can be constructed in an odd, but 
still proper, way. My favorite was a particular PDF that always crashed 
a tool we were using. It displayed fine, you couldn't see any anomalies 
in the file or anything. It turns out that someone couldn't figure out 
how to draw a horizontal line so they drew a line that was 400 pixels 
wide and 2 pixels tall.  This was outside what whoever wrote the tool 
had thought of and it didn't handle it well.

-- 
Ken Weinert