You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by rahul bhalla <ur...@gmail.com> on 2013/04/18 15:19:58 UTC

ignore header and footer

hello
how can i ignore any header or footer of the pdfdoc while extracting text
becoz when i extract text of the document it read footer as a next page
content

-- 
Regards
Rahul Bhalla

Re: ignore header and footer

Posted by Eliot Kimber <ek...@rsicms.com>.

For a general solution you must know the geometric bounds of the header and
footer. You then compare the x/y location of each text string (that is,
contiguous sequence of characters within the PDF data stream) to see if they
are within or without that boundary (with some heuristic for overlap).

If the publications you're operating on are consistent in their page layout
then this can be relatively easy, but if they're not, you may need to have
humans do "zoning" on each page manually by some means (e.g, you build a
visual tool or use Acrobat to add boxes or whatever).

If the header and footer contents are consistent you may be able to
recognize them just be looking at the text content but that depends on the
details of the document's you're processing.

Cheers,

E.

On 4/18/13 9:19 AM, "rahul bhalla" <ur...@gmail.com> wrote:

> hello
> how can i ignore any header or footer of the pdfdoc while extracting text
> becoz when i extract text of the document it read footer as a next page
> content
> 
> --
> Regards
> Rahul Bhalla

-- 
Eliot Kimber
Senior Solutions Architect, RSI Content Solutions
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.rsicms.com
www.rsuitecms.com
Book: DITA For Practitioners, from XML Press,
http://xmlpress.net/publications/dita/practitioners-1/