You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (JIRA)" <ji...@apache.org> on 2015/11/10 14:09:11 UTC

[jira] [Commented] (PDFBOX-3096) Lack of Bidi (Arabic / Hebrew) text reordering in text extracted with PDFbox

    [ https://issues.apache.org/jira/browse/PDFBOX-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998529#comment-14998529 ] 

Maruan Sahyoun commented on PDFBOX-3096:
----------------------------------------

To better understand the potential issues and if that is related to Bidi or happens because of other text extraction issues I'd like to
- know which version of PDFBox you were using for your tests
- have a sample PDF
- have a sample text extracted with a description where that text is wrong

{quote}
2. Bidi engine compliant with UBA (http://unicode.org/reports/tr9/) should be used for resolution of the issue.
{quote}

Although we can use a different engine for Bidi support could you elaborate a little bit where the text extraction fails **because** we are currently using the Java (Oracle JDK) Bidi engine?

> Lack of Bidi (Arabic / Hebrew) text reordering in text extracted with PDFbox
> ----------------------------------------------------------------------------
>
>                 Key: PDFBOX-3096
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3096
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: TOMER MAHLIN
>         Attachments: PDFBox_HebrewExtractedText.PNG
>
>
> Rendering rules for Bidi (Arabic / Hebrew) text in regular Windows / Android / iOS environment and Adobe environment are different. Adobe expect text to appear in visual bidi layout while modern system are working with logical bidi layout. 
> When text is extracted from PDF file it should be converted / normalized to logical bidi layout. 
> Example:
> Assuming capital letters stand for Bidi letters.
> 1. In Adobe document you see: CBA
> 2. When you extract the content and display it in Notepad (or web browser or any similar tool) you will see ABC while the expectation is to see CBA. 
> Assuming you have a real text with both Hebrew and English (or Arabic and English) characters the result display is completely ruined after text extraction. Moreover, even if we ignore the display and focus on text manipulation (search, comparison, concatenation etc.), it will fail if the same text authored in Notepad and extracted from PDF file are compared. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org