You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Martijn Brinkers (JIRA)" <ji...@apache.org> on 2010/11/26 16:23:14 UTC
[jira] Updated: (PDFBOX-908) Gracefull handle corrupt PDFs

     [ https://issues.apache.org/jira/browse/PDFBOX-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn Brinkers updated PDFBOX-908:
------------------------------------

    Attachment: PDFBOX-908.patch

patches

> Gracefull handle corrupt PDFs
> -----------------------------
>
>                 Key: PDFBOX-908
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-908
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.3.1
>            Reporter: Martijn Brinkers
>         Attachments: PDFBOX-908.patch, test-corrupt-R.pdf, test-integer-too-large.pdf, test-obj-missing-bj.pdf, test-stream-missing-endobj.pdf
>
>
> I will use PDFBox for text extraction and one of the main requirements are that it should extract as much text as possible. If the PDF document contains something that isn't strictly correct according to the PDF specs it should try recover gracefully and continue scanning if possible if forceParsing is enabled. While testing against a large batch of PDF documents (including large ebooks) I found that the parser sometimes stops parsing and/or extracting text even with forceParsing enabled.  I have attached a patch to make PDFBox handle some PDF problems more gracefully when  forceParsing is enabled.
> Some of my patches tries to handle certain situations differently from the existing code. For example the existing code to handle cases when an endobj is missing seems to be very complex. In all of my tests it seems to work better when the code just assumes that the endobj was missing. Whether or not assuming that endobj is missing or whether the existing way to cope with this is better is of course debatable. 
> A patch is included to handle situations where the data (DI) for an inline image contains the EI keyword. The EI is now only accepted if the char before EI is an end-of-line marker instead of whitespace.
> I have added the method #isContinueOnError to PDFParser.  By default it returns forceParsing but implementors can override it to stop parsing when a certain limit is reached (for example on a timeout).  This can be helpful to stop parsing when the parser gets stuck in an unlimited loop.
> BaseParser#readInt unread the data when a NumberFormatException was thrown. This resulted in an unlimited loop when forcParsing was enabled when testing with test-integer-too-large.pdf (see attached file). I think it's better to not unread data when an exception will be thrown because the risks are higher that you run into an unlimited loop.
> The other patches are just minor like checks for null values etc.
> I have attached four test PDF documents. These PDF documents are PDFs which I corrupted by hand to try to replicate similar situations I found in existing (copyrighted) ebooks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.