You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Cheng Leong (JIRA)" <ji...@apache.org> on 2014/01/19 00:35:19 UTC

[jira] [Updated] (PDFBOX-399) Gibberish Output

     [ https://issues.apache.org/jira/browse/PDFBOX-399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Leong updated PDFBOX-399:
-------------------------------

    Attachment: PDFBOX-399__Ignore_badly-formatted_CMap_ToUnicode_instructions.patch

Submitting a patch for ignoring badly-formatted CMap ToUnicode instructions.
This allows parsing of some ToUnicode resource streams that would otherwise throw exceptions which were silently consumed. This allows text extraction to get the correctly mapped characters.

Specifically parse token<hex> adjacency without whitespace separating them, eat all whitespace within a hex value, and return a partially constructed   CMap instead of throwing an exception.

I don't see a problem with the previous test case example (BlackHat...) but I've modified the test case based on an example from the wild: http://www.itsix.com/media/experienced_java_developer.pdf

> Gibberish Output
> ----------------
>
>                 Key: PDFBOX-399
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-399
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Sushil Duseja
>         Attachments: BlackHat-DC-09-Marlinspike-Defeating-SSL.pdf, PDFBOX-399__Ignore_badly-formatted_CMap_ToUnicode_instructions.patch, experienced_java_developer.pdf
>
>
> While extracting text from a pdf file using PDFBox, I get garbage output (*À¾´»*) for a special text value "2007"; this text ("2007") is written in CLRDingbats font. 
> Any pointer(s)?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)