You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Jason Day <so...@gmail.com> on 2010/03/04 07:56:05 UTC

How to handle arabic characters?

I want to extract text from a pdf file which contains some arabic
characters, the PDFTextStripper woks fine with this
pdf<http://issues.apache.org/jira/secure/attachment/12390533/hello3.pdf>,
but when come to another
pdf<http://www.un.org/chinese/sc/committees/1267/ConsolidatedList.pdf>,
however, it couldn't extract the correct arabic characters, instead, I
get
some thing like "afii52364afii62811afii62760afii62762".  Could anybody
figure out why? Many thanks!

My code is as follows:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;


public class PDFboxTest {
    public static String getTxt(File f) {
        String text="";
        try{
            PDDocument pdfdocument = PDDocument.load(f);
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(1);
            stripper.setEndPage(2);
            text = stripper.getText(pdfdocument);
            pdfdocument.close();

            System.out.println(f.getName() + "length is:" + text.length() +
"\n");
        } catch(IOException e) {
            e.printStackTrace();
        }

        return text;
    }

    public static void main(String[] args){
        File file = new File(args[0]);
        System.out.println(PDFboxTest.getTxt(file));
    }
}

-- 
由于人生有限，很多不重要的使事只好不做了
技术博客[研究研究]: http://www.yanjiuyanjiu.com
个人博客[灵魂机器]: http://www.soulmachine.cn