You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by vivek <vi...@gmail.com> on 2013/02/28 08:26:38 UTC

Crawling

Hi,

I am Vivek and i am working on nutch.I have a few doubts regarding Crawling:

1)when a page is fetched and after 30 days  it is due for
fetching(default),but what happens if a page does not exist now on the
web.Is it removed from the crawldb and segments or it still remains  there.
In case the page is stale and I want to remove it from my crawled data how
can I do that??


2)How to refresh a crawl? .I mean suppose I have crawled 100000 urls and
after 5th depth I want that the fetching should be done again from the
beginning without stopping the process i.e continous crawl from depth 1 to
5 continuously without stopping and this too in a cyclic process


-- 







Thanks and Regards,

VIVEK KOUL