You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Chris Bamford (JIRA)" <ji...@apache.org> on 2013/10/10 13:28:41 UTC

[jira] [Created] (PDFBOX-1744) Be resilient to PDFs with missing version info

Chris Bamford created PDFBOX-1744:
-------------------------------------

             Summary: Be resilient to PDFs with missing version info
                 Key: PDFBOX-1744
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1744
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 1.8.2
         Environment: PDFBox 1.8.2, IntelliJ IDEA 12.1.6, Mac OS X 10.7.5, Java 1.7, Maven 2.2.1
            Reporter: Chris Bamford
            Priority: Minor
             Fix For: 1.8.3


Proposed addition to 1.8.2 -> pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> parseHeader() to default the PDF version to 1.4 in situations where it is missing (yes, there really are docs out there like this!).
This prevents an exception caused from a negative substring offset calculation:  "String index out of range: -3"

I have floated the question on the users@pdfbox.apache.org mailing list (10th October 2013) and it was suggested I default the PDF version to 1.4 in this scenario.  I have tested it locally and it works (apparently PDFBox doesn't take the version number into account anyway).

Now over to you guys to decide if this is a good idea or not in the wider scope.

Should you give the green light, I attach:
# a sample file which causes the exception
# a patch file + instructions.

My goal is text extraction, even on broken files (if possible).



--
This message was sent by Atlassian JIRA
(v6.1#6144)