You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Massimo Miccoli <mm...@iltrovatore.it> on 2005/09/06 11:46:06 UTC

howto skip hiddens ulrs inside div tag?

Hi nutch dev,

After fetching about 100 mio of pages I see many search engine spammers
that use an hidden div tag (negative position) to include many urls
that user don't see whe acces the site page. This links alter the boost
(by inlink count) so I want to skip this urls.
How can I do that?

Thanks,

Massimo


Re: howto skip hiddens ulrs inside div tag?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Massimo Miccoli wrote:
> Hi nutch dev,
> 
> After fetching about 100 mio of pages I see many search engine spammers
> that use an hidden div tag (negative position) to include many urls
> that user don't see whe acces the site page. This links alter the boost
> (by inlink count) so I want to skip this urls.
> How can I do that?

Implement an HtmlParseFilter, similar to creativecommons plugin. This 
plugin will remove matching tags.

In fact, if you have some spare cycles, you could implement a more 
generic "html cleanup" plugin, where you could specify a list of XPaths 
to match (and optionally replace).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com