You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Emmanuel JOKE <jo...@gmail.com> on 2007/07/05 15:56:55 UTC

Recrawling question

I have crawled few website. Everything worked fine and now i have one
crawldb, segments, linkdb and indexes.

I decide to recrawl those website, so I will
- generate a list of urls from my existing crawldb
- fetch this segment
- update the existing crawldb
- invert links and finally index.

Should i create a new index and linkdb with all segments in the folder (
segments dated from my first crawl and segments dated from this crawl) ?
Or should i just use the command index (or invertlinks) with the existing
index (or existing linkdb) and i just specify only the new segments that
have just been crawled ?