You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bent Hugh <be...@gmail.com> on 2007/12/31 06:09:49 UTC

How to effectively manage crawl and recrawl?

I need to know a few things about how to manage Nutch crawl?

1. I have done a full crawl in which all possible intranet sites have
been discovered and indexed. Now I don't want to lose this index and
update the same index by recrawling over these sites once again. So,
if any page has changed, the content for that URL in the index should
be updated. Is this possible?

2. During the recrawl on the same set of websites, if it finds links
to new pages (to same website or other websites) which is not present
in the index currently, they should also be fetched and inserted to
the index. Is this possible?