You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Amir H. Jadidinejad" <am...@yahoo.com.INVALID> on 2014/08/02 22:45:00 UTC

Problem with mixed RTL/LTR pdfs

Hi,
I can extract the content of a monolingual PDF files using the following code:
        PDFTextStripper stripper = new PDFTextStripper();
        PDDocument doc = PDDocument.load(file);
        stripper.setSortByPosition(true);
        String txt = stripper.getText(doc);
        doc.close();


It's perfect when the input document is monolingual.

The problem is that when the input document is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed!

A sample bilingual pdf document is attached.

Would you please help me in this issue?

Thanks.

Re: Problem with mixed RTL/LTR pdfs

Posted by "Amir H. Jadidinejad" <am...@yahoo.com.INVALID>.
After reading "PDFTextStripper.java", I think it's a bug.
This class has a variable "isRtlDominant" in "writePage" function, which is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.
Any idea to solve the problem with minimum changes is welcomed.
Thanks.



________________________________
 From: Amir H. Jadidinejad <am...@yahoo.com.INVALID>
To: user pdfbox <us...@pdfbox.apache.org> 
Sent: Sunday, August 3, 2014 1:15 AM
Subject: Problem with mixed RTL/LTR pdfs
 


Hi,
I can extract the content of a monolingual PDF files using the following code:
        PDFTextStripper stripper = new PDFTextStripper();
        PDDocument doc = PDDocument.load(file);
        stripper.setSortByPosition(true);
        String txt = stripper.getText(doc);
        doc.close();


It's perfect when the input document is monolingual.

The problem is that when the input document is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed!

A sample bilingual pdf document is attached.

Would you please help me in this issue?

Thanks.