You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "TOMER MAHLIN (JIRA)" <ji...@apache.org> on 2015/11/10 13:25:10 UTC

[jira] [Updated] (PDFBOX-3096) Lack of Bidi (Arabic / Hebrew) text reordering in text extracted with PDFbox

     [ https://issues.apache.org/jira/browse/PDFBOX-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

TOMER MAHLIN updated PDFBOX-3096:
---------------------------------
    Summary: Lack of Bidi (Arabic / Hebrew) text reordering in text extracted with PDFbox  (was: Lack of Bidi (Arabic / Hebrew) test reordering in text extracted with PDFbox)

> Lack of Bidi (Arabic / Hebrew) text reordering in text extracted with PDFbox
> ----------------------------------------------------------------------------
>
>                 Key: PDFBOX-3096
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3096
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: TOMER MAHLIN
>         Attachments: PDFBox_HebrewExtractedText.PNG
>
>
> Rendering rules for Bidi (Arabic / Hebrew) text in regular Windows / Android / iOS environment and Adobe environment are different. Adobe expect text to appear in visual bidi layout while modern system are working with logical bidi layout. 
> When text is extracted from PDF file it should be converted / normalized to logical bidi layout. 
> Example:
> Assuming capital letters stand for Bidi letters.
> 1. In Adobe document you see: CBA
> 2. When you extract the content and display it in Notepad (or web browser or any similar tool) you will see ABC while the expectation is to see CBA. 
> Assuming you have a real text with both Hebrew and English (or Arabic and English) characters the result display is completely ruined after text extraction. Moreover, even if we ignore the display and focus on text manipulation (search, comparison, concatenation etc.), it will fail if the same text authored in Notepad and extracted from PDF file are compared. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org