You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2015/11/04 18:14:27 UTC

[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

    [ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989943#comment-14989943 ] 

Tilman Hausherr commented on PDFBOX-2252:
-----------------------------------------

A fourth mixed document can be found here:
http://www.konto.org/download/merkblatt-basiskonto-asylbewerber-arabisch.pdf

Text extraction is weird with PDFBox and Adobe Reader. Near the third occurence of the word "Duldung" there is a "2" at the wrong place. Adobe has it twice.

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Assignee: Maruan Sahyoun
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, bugzilla867751.pdf, content_diffs.xlsx, overlap.jpg, pdfs_directionality.xlsx, pdfs_directionality3.xlsx, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org