You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Andrew M. Lim (Jira)" <ji...@apache.org> on 2022/07/11 19:53:00 UTC
[jira] [Updated] (NIFI-10218) ExtractDocumentText processor does not handle certain characters when extracting from a PDF
[ https://issues.apache.org/jira/browse/NIFI-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew M. Lim updated NIFI-10218:
---------------------------------
Attachment: example.pdf
> ExtractDocumentText processor does not handle certain characters when extracting from a PDF
> -------------------------------------------------------------------------------------------
>
> Key: NIFI-10218
> URL: https://issues.apache.org/jira/browse/NIFI-10218
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Reporter: Andrew M. Lim
> Priority: Minor
> Attachments: 625006.pdf, example.pdf
>
>
> When a PDF has special characters ("+", "=",">", "+-"), when the text is extracted from the document, these characters show up with different symbols.
> I've attached two PDFs that illustrate the issue differently:
> * 625006.pdf has multiple pages. When the text is extracted from a table, certain characters show up as a ? symbol.
> * example.pdf is a single page with the same table. When the text is extracted the same characters show up as " or # symbols.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)