You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ali Nazemian <al...@gmail.com> on 2014/06/01 16:46:38 UTC

Incremental crawling with nutch

Hi everybody,
I am going to use nutch for crawling some news web site. These websites
will be updated regularly. Therefore I should recrawl them at least every 2
hours. But the problem is I want to have incremental re-crawl, it means
nutch should crawl only the urls that are new and not fetched before
(except the main page of each site for extracting new urls). I want in each
re-crawling process only the new URLs fetched and send to solr for
indexing. Would somebody guide me through this scenario with nutch 1.8?
Best regards.

-- 
A.Nazemian