You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Timo Boehme (Created) (JIRA)" <ji...@apache.org> on 2011/11/15 16:30:52 UTC

[jira] [Created] (PDFBOX-1171) Parsing hexadecimal strings is not strict enough + FIX

Parsing hexadecimal strings is not strict enough + FIX
------------------------------------------------------

Key: PDFBOX-1171
URL: https://issues.apache.org/jira/browse/PDFBOX-1171
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 1.7.0
Reporter: Timo Boehme
Priority: Minor

Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the same method parsing literal strings (strings in '(',')'). Since in literal strings the parsing of escape sequences and parentheses is quite tricky, there are a number of rules to capture problematic cases. However for hexadecimal strings this is not needed. Here we known of the allowed restricted character set and we don't have to count opening and closing brackets.

The problem with the relaxed parsing (and therefore this is marked as bug) is with parsing documents containing trash data between objects (I have a number of them - however confidential ones - produced by verypdf.com, which seems to got updated a lot and after an endobj it contains e.g. <PrY... - simply some remainings from old objects). This trash would be no problem when parsing with an ISO conforming parser since these ranges would be ignored, but with the current sequential parser it is parsed and the best one can hope is that the trash is found to be not parseable and the parser searches for a new starting point via PDFParser#skipToNextObj. This is now where the problem with the relaxed hexadecimal parsing is: as in the example the opening '<' triggers a hexadecimal string parsing it because of later '<' it goes until end of document, reading all valid objects as string content. With a more strict parsing we would find that it is not a hexadecimal string with the second character.

I have therefore added a method for parsing hexadecimal strings (see attached diff) which fails (IOException) if an invalid character is read (this method also runs faster than previous parser since there are only small number of cases to test).

With this change I can now parse the mentioned (correct) documents (with forced parsing) which wasn't possible before.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PDFBOX-1171) Parsing hexadecimal strings is not strict enough + FIX

Posted by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-1171.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

I added the proposed patch in revision 1209127 with some minor changes. I also refactored the parseCOSString method.

Thaks for the controbution
                
> Parsing hexadecimal strings is not strict enough + FIX
> ------------------------------------------------------
>
>                 Key: PDFBOX-1171
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1171
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.7.0
>            Reporter: Timo Boehme
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.7.0
>
>         Attachments: BaseParser.diff
>
>
> Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the same method parsing literal strings (strings in '(',')'). Since in literal strings the parsing of escape sequences and parentheses is quite tricky, there are a number of rules to capture problematic cases. However for hexadecimal strings this is not needed. Here we known of the allowed restricted character set and we don't have to count opening and closing brackets.
> The problem with the relaxed parsing (and therefore this is marked as bug) is with parsing documents containing trash data between objects (I have a number of them - however confidential ones - produced by verypdf.com, which seems to got updated a lot and after an endobj it contains e.g. <PrY... - simply some remainings from old objects). This trash would be no problem when parsing with an ISO conforming parser since these ranges would be ignored, but with the current sequential parser it is parsed and the best one can hope is that the trash is found to be not parseable and the parser searches for a new starting point via PDFParser#skipToNextObj. This is now where the problem with the relaxed hexadecimal parsing is: as in the example the opening '<' triggers a hexadecimal string parsing it because of later '<' it goes until end of document, reading all valid objects as string content. With a more strict parsing we would find that it is not a hexadecimal string with the second character.
> I have therefore added a method for parsing hexadecimal strings (see attached diff) which fails (IOException) if an invalid character is read (this method also runs faster than previous parser since there are only small number of cases to test).
> With this change I can now parse the mentioned (correct) documents (with forced parsing) which wasn't possible before.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1171) Parsing hexadecimal strings is not strict enough + FIX

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1171:
--------------------------------

    Attachment: BaseParser.diff

diff (against rev. 1151382 of BaseParser) of fix adding strict hexadecimal parser; the parseCOSString method could now refactored a bit to remove unnecessary hexadecimal string support
                
> Parsing hexadecimal strings is not strict enough + FIX
> ------------------------------------------------------
>
>                 Key: PDFBOX-1171
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1171
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.7.0
>            Reporter: Timo Boehme
>            Priority: Minor
>         Attachments: BaseParser.diff
>
>
> Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the same method parsing literal strings (strings in '(',')'). Since in literal strings the parsing of escape sequences and parentheses is quite tricky, there are a number of rules to capture problematic cases. However for hexadecimal strings this is not needed. Here we known of the allowed restricted character set and we don't have to count opening and closing brackets.
> The problem with the relaxed parsing (and therefore this is marked as bug) is with parsing documents containing trash data between objects (I have a number of them - however confidential ones - produced by verypdf.com, which seems to got updated a lot and after an endobj it contains e.g. <PrY... - simply some remainings from old objects). This trash would be no problem when parsing with an ISO conforming parser since these ranges would be ignored, but with the current sequential parser it is parsed and the best one can hope is that the trash is found to be not parseable and the parser searches for a new starting point via PDFParser#skipToNextObj. This is now where the problem with the relaxed hexadecimal parsing is: as in the example the opening '<' triggers a hexadecimal string parsing it because of later '<' it goes until end of document, reading all valid objects as string content. With a more strict parsing we would find that it is not a hexadecimal string with the second character.
> I have therefore added a method for parsing hexadecimal strings (see attached diff) which fails (IOException) if an invalid character is read (this method also runs faster than previous parser since there are only small number of cases to test).
> With this change I can now parse the mentioned (correct) documents (with forced parsing) which wasn't possible before.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira