You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Daniel Naber <lu...@danielnaber.de> on 2007/12/13 00:35:51 UTC

continuous crawling?

Hi,

how do people use Nutch to crawl continuously? Do you use the "recrawl" 
script from the Wiki and start that via cronjob? I'd prefer a process that 
runs forever and that makes sure the index is always mostly up-to-date.

In my case, I'm not trying to index the complete web but only interesting 
sites. What an interesting site is will be decided during crawling using a 
plugin I'm planning to write. Does anybody have experience with this kind 
of use case? To my understanding, I'll need to modify the Generator class 
so that it completely ignores pages (and their links) if the page is 
considered irrelevant by my plugin.

Regards
 Daniel

-- 
http://www.danielnaber.de