You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by carmmello <ca...@globo.com> on 2005/05/23 23:59:48 UTC

Maintainance of Nutch: crawl everything again?

Hi,
I think that the nutch-user mailing is becoming a very useful tool  for the people interested in running Nutch.
For that reason, I am asking a question whose answer I could not find, nor in the tutorial (including Stefan's one) nor in the mailing list.  This question has to do with  the maintenance of the crawled (and indexed) sites, when expiring time comes. This, really, is not clear to me.  For testing purposes, I indexed about 3,000 sites with the expiring time of just 1 day (I set this in the site.xml, configuration file).  After that  1 day I used the command "bin/nutch generate db segments" with the only option "-refetchonly". When I did a fetch of the generated segment, I got about 30,000 sites.  So, I really don't know what to do just to maintain updated, in a regular basis, the sites that really matter for some specific searching purpose.
So, my question is:  to maintain the data, is it necessary to repeat the crawl process all over again, or is there  some other quicker way that just grabs the modified or added sites, since the original crawl?
Thanks