You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/12/01 19:57:12 UTC

[jira] [Commented] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

    [ https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230241#comment-14230241 ] 

John Hewson commented on PDFBOX-2532:
-------------------------------------

The attached PDF is broken, even Acrobat cannot extract the text correctly. We already handle the necessary encoding mechanisms in 2.0 to read this file, it's just happens to contain nonsense.

> Text extraction fails due to the usage of the internal font mapping
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-2532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.0
>            Reporter: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX2247-701542.pdf
>
>
> If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)