You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ha...@hsbc.com on 2018/11/14 14:53:03 UTC

Block certain parts of HTML code from being indexed

Hello All,

I am using Nutch 1.15, and wondering if there is a feature for blocking certain parts of HTML code from being indexed (header & footer).

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__________________________________________________________________

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
__________________________________________________________________
Protect our environment - please only print this if you have to!



-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not copy,
forward, disclose or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the sender immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or virus-free.
The sender does not accept liability for any errors or omissions.

RE: Block certain parts of HTML code from being indexed

Posted by Yossi Tamari <yo...@pipl.com>.
Hi Hany,

The Tika parser supports Boilerpipe for header and footer removal, but I don't know how well it works.
You can test it online at https://boilerpipe-web.appspot.com/


> -----Original Message-----
> From: hany.nasr@hsbc.com <ha...@hsbc.com>
> Sent: 14 November 2018 16:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking certain
> parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate Functions |
> HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> Kraków, Poland
> _________________________________________________________________
> _
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
> _________________________________________________________________
> _
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you may not copy,
> forward, disclose or use any part of it. If you have received this message in
> error, please delete it and all copies from your system and notify the sender
> immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.