You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by reinhard schwab <re...@aon.at> on 2010/09/04 13:32:36 UTC

text extraction

extracted text with

PDDocument doc = PDDocument.load(new URL(
                           
"http://people.ischool.berkeley.edu/~hearst/irbook/print/chap10.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
stripper.writeText(doc, new OutputStreamWriter(System.out));

looks like this

¡ ¢¤£¦¥¨§ª© ­®©°¯±¢²§ª³ ´¶µ¸·¹¢º© » ¥¼µ½§ff·fi¥ffifl¼´²Â
 "!$#&%ª')(+* ,-%ª.ff/0%ff132"%ff45.ff6
,-.7'84:97!;.7'< "!>=?.ª!>'fi*ª1A@B.C4®*
ACM Press
New York
Addison-Wesley
D)EGFIH J>KMLON8P$QRH ESPUTffVffWYXZE>TR[\PUQ]L_^`E>ababE>cedgfUahX;ijija
...

best regards
reinhard