You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2020/03/21 13:31:00 UTC

[jira] [Closed] (PDFBOX-4795) Hebrew words are extracted with no whitespace between

     [ https://issues.apache.org/jira/browse/PDFBOX-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-4795.
--------------------------------------
      Assignee: Andreas Lehmkühler
    Resolution: Not A Bug

[~jmbox80] Thanks for the detailed feedback. I'm closing this as proposed.

W.r.t. to any performance concerns: yes, any additional processing will add some additional time, but IMHO most likely it wont be an issue. There might be some rare corner cases but in the end you need to activate the sot option as long as you don't now if the pdfs you are processing require sorting or not. And no, I'm nor aware of a easy way to determine if sorting is required or not.

> Hebrew words are extracted with no whitespace between
> -----------------------------------------------------
>
>                 Key: PDFBOX-4795
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4795
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.19
>         Environment: Windows 10
>            Reporter: Josh Burchard
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>         Attachments: PDFBOX-4795-hebrew_newsletter_sorted.txt, hebrew_newsletter.pdf
>
>
> When I extract Hebrew text from the included PDF, white space delimiting the words is not output.
> Example string of text as appears in the PDF:
> מאיר שמגר. ״ההלכות
> And the string as PDFBox extracts it:
> ״ההלכותשמגר.מאיר
> The words themselves are presented LTR, instead of RTL.  It would be nice to have them RTL, but in my particular use case that doesn't matter as I'm creating an index.  The spaces between matter a lot, however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org