You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Michael Reynolds (Jira)" <ji...@apache.org> on 2020/01/30 20:23:00 UTC

[jira] [Commented] (PDFBOX-4758) Text Extractor does not handle common typographic ligatures

    [ https://issues.apache.org/jira/browse/PDFBOX-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026979#comment-17026979 ] 

Michael Reynolds commented on PDFBOX-4758:
------------------------------------------

The unit test contains test cases with failing outputs, it would be acceptable to either extract the normalized characters (preferable) or the ligatures so that it is possible to correct them post-extraction. In these test cases it appears that the information is lost altogether.

> Text Extractor does not handle common typographic ligatures
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4758
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4758
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.1, 2.0.18
>            Reporter: Michael Reynolds
>            Priority: Major
>         Attachments: TestExtractText.java, libreoffice-ligatures-test.pdf, msword-ligatures-test.pdf
>
>
> TextExtractor mishandles typographic ligatures. I've attached test documents from both Microsoft Word and LibreOffice.
> I've checked PDFBox's output against xPDF on CentOS, and the ligatures are properly handled with that utililty, so it appears that this is a PDFBox defect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org