You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@nifi.apache.org by "Andrew M. Lim (Jira)" <ji...@apache.org> on 2022/07/11 19:53:00 UTC

[jira] [Updated] (NIFI-10218) ExtractDocumentText processor does not handle certain characters when extracting from a PDF

     [ https://issues.apache.org/jira/browse/NIFI-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew M. Lim updated NIFI-10218:
---------------------------------
    Attachment: example.pdf

> ExtractDocumentText processor does not handle certain characters when extracting from a PDF
> -------------------------------------------------------------------------------------------
>
>                 Key: NIFI-10218
>                 URL: https://issues.apache.org/jira/browse/NIFI-10218
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>            Reporter: Andrew M. Lim
>            Priority: Minor
>         Attachments: 625006.pdf, example.pdf
>
>
> When a PDF has special characters ("+", "=",">", "+-"), when the text is extracted from the document, these characters show up with different symbols. 
> I've attached two PDFs that illustrate the issue differently:
> * 625006.pdf has multiple pages. When the text is extracted from a table, certain characters show up as a ? symbol.
> * example.pdf is a single page with the same table. When the text is extracted the same characters show up as " or # symbols.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)