You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Jason Day <so...@gmail.com> on 2010/03/04 07:56:05 UTC
How to handle arabic characters?
I want to extract text from a pdf file which contains some arabic
characters, the PDFTextStripper woks fine with this
pdf<http://issues.apache.org/jira/secure/attachment/12390533/hello3.pdf>,
but when come to another
pdf<http://www.un.org/chinese/sc/committees/1267/ConsolidatedList.pdf>,
however, it couldn't extract the correct arabic characters, instead, I
get
some thing like "afii52364afii62811afii62760afii62762". Could anybody
figure out why? Many thanks!
My code is as follows:
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PDFboxTest {
public static String getTxt(File f) {
String text="";
try{
PDDocument pdfdocument = PDDocument.load(f);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
text = stripper.getText(pdfdocument);
pdfdocument.close();
System.out.println(f.getName() + "length is:" + text.length() +
"\n");
} catch(IOException e) {
e.printStackTrace();
}
return text;
}
public static void main(String[] args){
File file = new File(args[0]);
System.out.println(PDFboxTest.getTxt(file));
}
}
--
由于人生有限,很多不重要的使事只好不做了
技术博客[研究研究]: http://www.yanjiuyanjiu.com
个人博客[灵魂机器]: http://www.soulmachine.cn