You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by pavankumar <ma...@gmail.com> on 2007/12/24 07:02:34 UTC

To avoid recrawl to index unchanged content.

Hi,
    I am able to successfully crawl using nutch 0.9 API and following steps
mentioned in doc. But when I am re-crawling, it is indexing even the
urls/files which have not changed also. How can I make nutch to index only
the content that has changed? I can not assume which filles/urls have
changed duirng a certain period. So I need to fetch all of them. But I want
to index only those files/urls which have their content changed after the
last crawl so that the recrawl time gets reduced. Actually my re-crawl is
taking more time compared to a fresh crawl. How can I improve the time spent
while re-crawling? Is it better to do a fresh crawl every time or do a
re-crawl?
-- 
View this message in context: http://www.nabble.com/To-avoid-recrawl-to-index-unchanged-content.-tp14484900p14484900.html
Sent from the Nutch - User mailing list archive at Nabble.com.