You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Josh Burchard (Jira)" <ji...@apache.org> on 2020/03/06 21:03:00 UTC

[jira] [Created] (PDFBOX-4795) Hebrew words are extracted with no whitespace between

Josh Burchard created PDFBOX-4795:
-------------------------------------

             Summary: Hebrew words are extracted with no whitespace between
                 Key: PDFBOX-4795
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4795
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 2.0.19
         Environment: Windows 10
            Reporter: Josh Burchard
         Attachments: hebrew_newsletter.pdf

When I extract Hebrew text from the included PDF, white space delimiting the words is not output.

Example string of text as appears in the PDF:
מאיר שמגר. ״ההלכות

And the string as PDFBox extracts it:
״ההלכותשמגר.מאיר

The words themselves are presented LTR, instead of RTL.  It would be nice to have them RTL, but in my particular use case that doesn't matter as I'm creating an index.  The spaces between matter a lot, however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org