You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by David Patterson <pa...@gmail.com> on 2017/05/11 10:47:38 UTC
Extract Text from page object?
Is is possible to
(a) iterate over the PDF by page [I believe the answer is "Yes"]
(b) extract the text from a page [Don't know]
This would allow some nice capabilities, but with an added complexity of
words that split between pages.
Thanks for the info.
Dave Patterson
Re: Extract Text from page object?
Posted by David Patterson <pa...@gmail.com>.
Toël, thank you. That helps me a lot.
The support for this project is great.
David Patterson
On Thu, May 11, 2017 at 7:48 AM, Hartmann Toël <To...@elanders.com>
wrote:
> (a) yes
> (b) yes
>
> very basic example code:
> StringWriter out = new StringWriter();
> PDDocument doc = PDDocument.load(file);
> nbPages = doc.getNumberOfPages();
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.setStartPage(1);
> stripper.setEndPage(1);
> stripper.writeText(doc, out);
> txt = out.toString().trim();
> out.close();
> doc.close();
>
> Please check the sample code included in pdfbox for better examples
>
> Best regards
> Toël Hartmann
>
> On 11 maj 2017, at 12:47, David Patterson <pa...@gmail.com> wrote:
>
> > Is is possible to
> > (a) iterate over the PDF by page [I believe the answer is "Yes"]
> > (b) extract the text from a page [Don't know]
> >
> > This would allow some nice capabilities, but with an added complexity of
> > words that split between pages.
> >
> > Thanks for the info.
> >
> > Dave Patterson
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
Re: Extract Text from page object?
Posted by Hartmann Toël <To...@elanders.com>.
(a) yes
(b) yes
very basic example code:
StringWriter out = new StringWriter();
PDDocument doc = PDDocument.load(file);
nbPages = doc.getNumberOfPages();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(1);
stripper.writeText(doc, out);
txt = out.toString().trim();
out.close();
doc.close();
Please check the sample code included in pdfbox for better examples
Best regards
Toël Hartmann
On 11 maj 2017, at 12:47, David Patterson <pa...@gmail.com> wrote:
> Is is possible to
> (a) iterate over the PDF by page [I believe the answer is "Yes"]
> (b) extract the text from a page [Don't know]
>
> This would allow some nice capabilities, but with an added complexity of
> words that split between pages.
>
> Thanks for the info.
>
> Dave Patterson
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org