You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2014/10/23 19:11:34 UTC

[jira] [Closed] (PDFBOX-1244) the text content extracted by PDFBOX is not as the same as it is displayed in Adobe reader

     [ https://issues.apache.org/jira/browse/PDFBOX-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-1244.
--------------------------------------
    Resolution: Not a Problem
      Assignee: Andreas Lehmkühler

PDFBox extracts the very same text than the acrobat reader. And yes it's not the displayed text, which leads to the assumption that the toUnicode mapping of the pdf is broken. 

Closed as "Not a problem"



> the text content extracted by PDFBOX is not as the same as it is displayed in Adobe reader
> ------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1244
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1244
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: windows xp, Eclipse 3.2.0
>            Reporter: huangchangan
>            Assignee: Andreas Lehmkühler
>         Attachments: P020101210619863754780 214.pdf
>
>
> Hello, 
> I useed pdfbox extract text content from the PDF document in the appendix, founded the extracted text is "年预" but the text displayed in Adobe reader is "年期".  I want to know how to get the correct text content (as Adobe reader showing) from this kind of PDF documents by PDFBOX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)