You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Yoni Amir <yo...@targetize.com> on 2006/12/04 12:24:34 UTC

Re: Re-crawl

I am struggling with the same questions. I don't understand how nutch
decides whether to re-fetch content that was not updated, and how/where
to configure it?

Any help will be greatly appreciated :)

Yoni

On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote:
> First time I let nutch crawl and if some urls are not feteched, nutch reports
> an error in the log file. Is there a way, Nutch can re-crawl and update the
> affected/non-fetched ones and do not do any operations on the valid ones?
> 
> Also, If I wanted to recrawl again, say after few days/months on the same
> website and some content of the website was updated and some not. What does
> nutch do in this case? What operations does it do for the 
> 1. updated content
> 2. not-updated content
> in the current database (local database from the previous crawl)?
> 
> Does it just get the updated contents? Does it get all?
> 
> If nutch gets everything(updated and non-updated), is there a way, we can
> ask nutch to get only the updated content?
>