You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Amir (JIRA)" <ji...@apache.org> on 2014/09/08 23:17:28 UTC

[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

    [ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126114#comment-14126114 ] 

Amir commented on PDFBOX-2259:
------------------------------

would you please check this issue again? Semi-spaces is very common in different non-english languages.

> PDFTextStripper has problem with semi-space characters
> ------------------------------------------------------
>
>                 Key: PDFBOX-2259
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>            Reporter: Amir
>         Attachments: test.pdf
>
>
> In some right-to-left languages, compound words are separated using "semi-space" (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)