You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ashit Patel <as...@yahoo.com> on 2005/05/24 01:50:46 UTC

How do I exclude portions of the HTML content from being indexed

Hi,

I would like to direct Nutch to exclude parts of a
page from crawling & indexing. Is there a way to do so
using special tags/configuration?

Thanks,
Ashit

Re: How do I exclude portions of the HTML content from being indexed

Posted by Andy Liu <an...@gmail.com>.
You can do this by modifying the parse-html plugin.  You'll see that
the HtmlParser makes calls to DOMContentUtils to extract the text from
the page.  Make changes to getText() to exclude any content that you
don't want.

Andy

On 5/23/05, Ashit Patel <as...@yahoo.com> wrote:
> Hi,
> 
> I would like to direct Nutch to exclude parts of a
> page from crawling & indexing. Is there a way to do so
> using special tags/configuration?
> 
> Thanks,
> Ashit
>