You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by David Patterson <pa...@gmail.com> on 2017/05/11 10:47:38 UTC

Extract Text from page object?

Is is possible to
(a) iterate over the PDF by page [I believe the answer is "Yes"]
(b) extract the text from a page [Don't know]

This would allow some nice capabilities, but with an added complexity of
words that split between pages.

Thanks for the info.

Dave Patterson

Re: Extract Text from page object?

Posted by David Patterson <pa...@gmail.com>.

Toël, thank you. That helps me a lot.

The support for this project is great.

David Patterson

On Thu, May 11, 2017 at 7:48 AM, Hartmann Toël <To...@elanders.com>
wrote:

> (a) yes
> (b) yes
>
> very basic example code:
>             StringWriter out = new StringWriter();
>             PDDocument doc = PDDocument.load(file);
>             nbPages = doc.getNumberOfPages();
>             PDFTextStripper stripper = new PDFTextStripper();
>             stripper.setStartPage(1);
>             stripper.setEndPage(1);
>             stripper.writeText(doc, out);
>             txt = out.toString().trim();
>             out.close();
>             doc.close();
>
> Please check the sample code included in pdfbox for better examples
>
> Best regards
> Toël Hartmann
>
> On 11 maj 2017, at 12:47, David Patterson <pa...@gmail.com> wrote:
>
> > Is is possible to
> > (a) iterate over the PDF by page [I believe the answer is "Yes"]
> > (b) extract the text from a page [Don't know]
> >
> > This would allow some nice capabilities, but with an added complexity of
> > words that split between pages.
> >
> > Thanks for the info.
> >
> > Dave Patterson
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Extract Text from page object?

Posted by Hartmann Toël <To...@elanders.com>.

(a) yes
(b) yes

very basic example code:
	    StringWriter out = new StringWriter();
            PDDocument doc = PDDocument.load(file);
            nbPages = doc.getNumberOfPages();
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(1);
            stripper.setEndPage(1);
            stripper.writeText(doc, out);
            txt = out.toString().trim();
            out.close();
            doc.close();

Please check the sample code included in pdfbox for better examples

Best regards
Toël Hartmann

On 11 maj 2017, at 12:47, David Patterson <pa...@gmail.com> wrote:

> Is is possible to
> (a) iterate over the PDF by page [I believe the answer is "Yes"]
> (b) extract the text from a page [Don't know]
> 
> This would allow some nice capabilities, but with an added complexity of
> words that split between pages.
> 
> Thanks for the info.
> 
> Dave Patterson


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org