You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by lars <lo...@comhem.se> on 2008/12/14 19:17:22 UTC

Can PDFTextStripper be configured to skip page footers, page numbers, etc?

Hello all!

Is it possible to configure a PDFTextStripper instance so that it does
not include page footer text and page numbers in the extracted text?

Thx,

Lars

Re: Can PDFTextStripper be configured to skip page footers, page numbers, etc?

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Actually, there is a concept in PDF to mark content semantically. It's
called Tagged PDF. But for the extraction to profit from that the input
PDF would have to have tags in the first place (which most PDFs don't).
AFAIK, PDFBox doesn't support tagged PDF, yet.

On 19.12.2008 17:31:26 Andreas Lehmkühler wrote:
> Hi Lars
> 
> > Is it possible to configure a PDFTextStripper instance so that it does
> > not include page footer text and page numbers in the extracted text?
> As far as I know, there are no special commands for the page footer,
> header or numbers. Consequently it is imposible to determine these parts
> of a page and of course impossible to exclude them.
> 
> BR
> Andreas

Jeremias Maerki

Re: Can PDFTextStripper be configured to skip page footers, page numbers, etc?

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi Lars

> Is it possible to configure a PDFTextStripper instance so that it does
> not include page footer text and page numbers in the extracted text?
As far as I know, there are no special commands for the page footer,
header or numbers. Consequently it is imposible to determine these parts
of a page and of course impossible to exclude them.

BR
Andreas