You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Andrew M. Lim (Jira)" <ji...@apache.org> on 2022/07/11 19:53:00 UTC

[jira] [Created] (NIFI-10218) ExtractDocumentText processor does not handle certain characters when extracting from a PDF

Andrew M. Lim created NIFI-10218:
------------------------------------

             Summary: ExtractDocumentText processor does not handle certain characters when extracting from a PDF
                 Key: NIFI-10218
                 URL: https://issues.apache.org/jira/browse/NIFI-10218
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
            Reporter: Andrew M. Lim
         Attachments: 625006.pdf, example.pdf

When a PDF has special characters ("+", "=",">", "+-"), when the text is extracted from the document, these characters show up with different symbols. 

I've attached two PDFs that illustrate the issue differently:

* 625006.pdf has multiple pages. When the text is extracted from a table, certain characters show up as a ? symbol.
* example.pdf is a single page with the same table. When the text is extracted the same characters show up as " or # symbols.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)