You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel Naber <lu...@danielnaber.de> on 2007/12/13 00:35:51 UTC
continuous crawling?
Hi,
how do people use Nutch to crawl continuously? Do you use the "recrawl"
script from the Wiki and start that via cronjob? I'd prefer a process that
runs forever and that makes sure the index is always mostly up-to-date.
In my case, I'm not trying to index the complete web but only interesting
sites. What an interesting site is will be decided during crawling using a
plugin I'm planning to write. Does anybody have experience with this kind
of use case? To my understanding, I'll need to modify the Generator class
so that it completely ignores pages (and their links) if the page is
considered irrelevant by my plugin.
Regards
Daniel
--
http://www.danielnaber.de