You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/19 19:10:49 UTC
selective crawl
Hello,
If I were to identify certain pages as pages of interest, in the
parse-html plugin, how can I index only pages I mark as interesting,
and exclude the rest? However I have to be able to extract outlinks
from pages of non-interest.
What would be the correct approach to do that?
Best.
Re: selective crawl
Posted by Markus Jelsma <ma...@openindex.io>.
So you still want to crawl and parse (for outlinks) but not index. Maybe using
a parse filter to mark a page as interesting (perhaps by adding it to the meta
data) and making an indexing filter that conditionally indexes pages based on
that mark.
> Hello,
>
> If I were to identify certain pages as pages of interest, in the
> parse-html plugin, how can I index only pages I mark as interesting,
> and exclude the rest? However I have to be able to extract outlinks
> from pages of non-interest.
>
> What would be the correct approach to do that?
>
> Best.