You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by NiTiN <pu...@gmail.com> on 2008/12/10 08:27:57 UTC
how can i extract the content of my pdf file
Hi,
i dont know how to extract all content of given pdf file using pdfbox,
Please give me proper direction for that..
Thank you ,
NiTiN
Re: how can i extract the content of my pdf file
Posted by Daniel Manzke <da...@googlemail.com>.
Hi,
I use this code for extracting the text of my pdf files for adding them to
the lucene index:
public Reader extractText(InputStream stream,
String type,
String encoding) throws IOException {
try {
PDFParser parser = new PDFParser(new
BufferedInputStream(stream));
try {
parser.parse();
PDDocument document = parser.getPDDocument();
CharArrayWriter writer = new CharArrayWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setLineSeparator("\n");
stripper.writeText(document, writer);
return new CharArrayReader(writer.toCharArray());
} finally {
try {
PDDocument doc = parser.getPDDocument();
if (doc != null) {
doc.close();
}
} catch (IOException e) {
// ignore
}
}
} catch (Throwable e) {
logger.log(Level.WARNING, "Failed to extract PDF text content",
e);
return new StringReader("");
} finally {
stream.close();
}
}
2008/12/10 NiTiN <pu...@gmail.com>
> Hi,
>
> i dont know how to extract all content of given pdf file using pdfbox,
> Please give me proper direction for that..
>
>
> Thank you ,
> NiTiN
>
--
Mit freundlichen Grüßen
Daniel Manzke