You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/19 19:10:49 UTC

selective crawl

Hello,

If I were to identify certain pages as pages of interest, in the
parse-html plugin, how can I index only pages I mark as interesting,
and exclude the rest? However I have to be able to extract outlinks
from pages of non-interest.

What would be the correct approach to do that?

Best.

Re: selective crawl

Posted by Markus Jelsma <ma...@openindex.io>.
So you still want to crawl and parse (for outlinks) but not index. Maybe using 
a parse filter to mark a page as interesting (perhaps by adding it to the meta 
data) and making an indexing filter that conditionally indexes pages based on 
that mark.

> Hello,
> 
> If I were to identify certain pages as pages of interest, in the
> parse-html plugin, how can I index only pages I mark as interesting,
> and exclude the rest? However I have to be able to extract outlinks
> from pages of non-interest.
> 
> What would be the correct approach to do that?
> 
> Best.