You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by NiTiN <pu...@gmail.com> on 2008/12/10 08:27:57 UTC

how can i extract the content of my pdf file

Hi,

 i dont know how to extract all content of given pdf file using pdfbox,
Please give me proper direction for that..


Thank you ,
NiTiN

Re: how can i extract the content of my pdf file

Posted by Daniel Manzke <da...@googlemail.com>.

Hi,
I use this code for extracting the text of my pdf files for adding them to
the lucene index:

    public Reader extractText(InputStream stream,
                              String type,
                              String encoding) throws IOException {
        try {
            PDFParser parser = new PDFParser(new
BufferedInputStream(stream));
            try {
                parser.parse();
                PDDocument document = parser.getPDDocument();
                CharArrayWriter writer = new CharArrayWriter();

                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setLineSeparator("\n");
                stripper.writeText(document, writer);

                return new CharArrayReader(writer.toCharArray());
            } finally {
                try {
                    PDDocument doc = parser.getPDDocument();
                    if (doc != null) {
                        doc.close();
                    }
                } catch (IOException e) {
                    // ignore
                }
            }
        } catch (Throwable e) {
            logger.log(Level.WARNING, "Failed to extract PDF text content",
e);
            return new StringReader("");
        } finally {
            stream.close();
        }
    }


2008/12/10 NiTiN <pu...@gmail.com>

> Hi,
>
>  i dont know how to extract all content of given pdf file using pdfbox,
> Please give me proper direction for that..
>
>
> Thank you ,
> NiTiN
>



-- 
Mit freundlichen Grüßen

Daniel Manzke