You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2019/07/27 10:02:00 UTC

[jira] [Commented] (PDFBOX-4612) The ExtractText command extracts wrong text

    [ https://issues.apache.org/jira/browse/PDFBOX-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894384#comment-16894384 ] 

Tilman Hausherr commented on PDFBOX-4612:
-----------------------------------------

This is because of an incorrect glyph name (C24) in the font (page 7, font F5). Adobe Reader is also unable to extract it properly, it also brings "ataxia, and death by (SOH)4 months". (The attached file is our extraction)  See also [FAQ|[https://pdfbox.apache.org/2.0/faq.html#text-extraction].]

> The ExtractText command extracts wrong text
> -------------------------------------------
>
>                 Key: PDFBOX-4612
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4612
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.16
>            Reporter: Yuri
>            Priority: Major
>         Attachments: bartel2018-p7.txt
>
>
> In this pdf [http://sci-hub.tw/10.1016/j.cell.2018.03.006] it extracts the text "ataxia, and death by ~4 months" as "ataxia, and death by ^A4 months".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org