You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "martijn.list" <ma...@gmail.com> on 2010/11/24 12:37:41 UTC

Making PDFBox more robust when handling invalid PDFs

PDFBox chokes on a lot of real world PDFs. The majority of these PDFs
contain data that is not completely PDF compliant. PDFBox parser allows
you to set forceParsing to force the parser to 'keep on trucking' in
case of an error.
I have tested TextExtractor on a large batch of PDF ebooks and noticed
that even when forceParsing is set to true, some PDFs could not be read.

I have patched PDF to handle those PDFs with forceParsing enabled. Where
possible I have added a "if (forceParsing)" statement to ignore when
possible. This however was not always possible because the forceParsing
variable was not available.

Is there any interest in these patches to make PDFBox more robust
against faulty PDFs?

If should I create a JIRA bug with the patches attached?

To handle situations where "forceParsing" is not available, would it be
a good idea to define a system setting (for example pdfbox.forceparsing)
that can be used to set forceParsing on a global level?

Kind regards,

Martijn Brinkers


Re: Making PDFBox more robust when handling invalid PDFs

Posted by Jukka Zitting <jz...@adobe.com>.
Hi,

On 24/11/10 12:37, martijn.list wrote:
> I have patched PDF to handle those PDFs with forceParsing enabled. Where
> possible I have added a "if (forceParsing)" statement to ignore when
> possible. This however was not always possible because the forceParsing
> variable was not available.
>
> Is there any interest in these patches to make PDFBox more robust
> against faulty PDFs?

Definitely! See also the recent PDFBOX-789 issue where I made the 
forceParsing variable available to larger parts of the codebase.

> If should I create a JIRA bug with the patches attached?

Yes.

> To handle situations where "forceParsing" is not available, would it be
> a good idea to define a system setting (for example pdfbox.forceparsing)
> that can be used to set forceParsing on a global level?

See the org.apache.pdfbox.forceParsing system property that I added 
exactly for this purpose in revision 1022431 as a part of the PDFBOX-789 
fix. The value of that system property is used as the default value of 
the forceParsing variable when the client application doesn't explicitly 
specify it.

BR,

Jukka Zitting

Re: Making PDFBox more robust when handling invalid PDFs

Posted by Ad...@swmc.com.
I'd suggest testing problematic files with forceParsing off as well as on. 
 I've seen at least one case where it crashed if forceParsing was on, but 
worked just fine when forceParsing was off.

I've dealt with a lot of PDFs from varying sources, so I've seen quite a 
few non-conforming documents.  While I'm sure they exist, I haven't seen 
any cases where PDDocument.load() would fail when forceParsing was off but 
work when forceParsing was on.  The compatibility patches I've added to 
handle corrupt PDFs will work the same either way.  The idea is that it 
should recover as best it can from invalid PDFs.  Adding the force option 
also tries to recover from major errors which may result in a PDDocument 
with fundamental problems which may cause more headaches later.

That is why I never use the force option.  If someone uploads a something 
like calc.exe and I try to load it, I want PDDocument.load() to throw an 
exception letting me know something is seriously wrong with this "PDF". It 
wouldn't make sense to continue processing it and trying to do things like 
get the page count.  It'd be better to just deal with the fact that this 
is not a PDF in the catch block.  On the other hand, if someone uploads a 
non-conforming PDF, load() should (and in my experience, does) just 
recover from any minor deviations from the spec (typically missing 
characters, missing tags, or tags which are out of order).

Bring on the JIRA reports with whatever you come across.  I like patching 
the parser to better handle nonconforming PDFs :-)

---- 
Thanks,
Adam



From:
"martijn.list" <ma...@gmail.com>
To:
dev@pdfbox.apache.org
Date:
11/24/2010 12:56
Subject:
Re: Making PDFBox more robust when handling invalid PDFs



On 11/24/2010 06:18 PM, Adam@swmc.com wrote:
> When you attach your patches, please include an example, if possible. 

Yes will do that when possible. Sometimes however it's not always
straightforward what the actual problem is. If a PDF is 'corrupt' or not
following the PDF spec, the parser can bomb out. If forceParsing is
enabled, the PDF parser will try to continue. This can sometimes result
in strange artefacts like empty streams etc. I think the parser should
try to cope with these problems if forceParsing is true. In those cases
however it's not always clear what to report because the problem happens
because the parser tries to gracefully handle another PDF document 
problem.

Kind regards,

Martijn Brinkers



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  

Re: Making PDFBox more robust when handling invalid PDFs

Posted by "martijn.list" <ma...@gmail.com>.
On 11/24/2010 06:18 PM, Adam@swmc.com wrote:
> When you attach your patches, please include an example, if possible. 

Yes will do that when possible. Sometimes however it's not always
straightforward what the actual problem is. If a PDF is 'corrupt' or not
following the PDF spec, the parser can bomb out. If forceParsing is
enabled, the PDF parser will try to continue. This can sometimes result
in strange artefacts like empty streams etc. I think the parser should
try to cope with these problems if forceParsing is true. In those cases
however it's not always clear what to report because the problem happens
because the parser tries to gracefully handle another PDF document problem.

Kind regards,

Martijn Brinkers

Re: Making PDFBox more robust when handling invalid PDFs

Posted by Ad...@swmc.com.
When you attach your patches, please include an example, if possible. 
Since many PDFs have confidential information or are under copyright, this 
isn't always possible.  In those cases, it may be helpful to see the 
object like the sample below.  It's never required, but if you can include 
it I think it'll help us understand what's going on and how the patch 
addresses it.

<< /Type /Pages /Kids [
4 0 R
39 0 R
51 0 R
75 0 R
] /Count 4
/Rotate 0>>
endobj

---- 
Thanks,
Adam



From:
"martijn.list" <ma...@gmail.com>
To:
dev@pdfbox.apache.org
Date:
11/24/2010 03:38
Subject:
Making PDFBox more robust when handling invalid PDFs



PDFBox chokes on a lot of real world PDFs. The majority of these PDFs
contain data that is not completely PDF compliant. PDFBox parser allows
you to set forceParsing to force the parser to 'keep on trucking' in
case of an error.
I have tested TextExtractor on a large batch of PDF ebooks and noticed
that even when forceParsing is set to true, some PDFs could not be read.

I have patched PDF to handle those PDFs with forceParsing enabled. Where
possible I have added a "if (forceParsing)" statement to ignore when
possible. This however was not always possible because the forceParsing
variable was not available.

Is there any interest in these patches to make PDFBox more robust
against faulty PDFs?

If should I create a JIRA bug with the patches attached?

To handle situations where "forceParsing" is not available, would it be
a good idea to define a system setting (for example pdfbox.forceparsing)
that can be used to set forceParsing on a global level?

Kind regards,

Martijn Brinkers






- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.