You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by tittutomen <su...@gmail.com> on 2009/10/14 12:58:58 UTC

Recrawl Strategy with Nutch!

We have crawled a million urls and we want to continuously recrawl these
sites for updates.

The DFS cluster architecture is having 4 machines with 1 Master and 4
Slaves. To crawl the 

1 miilion sites it took around 10 days.

 

How possibly we will have a recrawl strategy to get the updates quickly? How
will we optimize

the Nutch recrawl script so that frequently changing sites will be recrawled
quickly and the index is formed?

Could we do an incremental index building from the crawl db someway?

 

Please suggest.

-- 
View this message in context: http://www.nabble.com/Recrawl-Strategy-with-Nutch%21-tp25888971p25888971.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.