You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Thompson <jo...@gmail.com> on 2008/07/08 18:58:43 UTC

Crawling the internet and adding to the index over time

Hi,

Maybe I'm missing something very obvious, but I've been trying to figure
this out for a full day now and haven't made a lot of progress.  I've tried
running through a lot of the code for the below commands, but the Hadoop
framework kind of hurts readability.  Anyway, here's what I've done:

   1. inject
   2. *loop on these:*
      1. generate
      2. fetch
      3. updatedb
   3. invertlinks
   4. index

All I want to do is, after having finished step 4), at some later point in
time be able to return to 2) and, after looping some more, be able to add
the new results to my old index.  The scripts for recrawling, as posted on
the Nutch wiki, are obsolete.  There are 4 separate merge commands - is the
only one I need to use the actual "merge" command, which just merges
indexes?

My intution is that I can just follow steps 1-3) and then do:

nutch index crawl_dir/new_indexes crawl_dir/crawldb crawl_dir/linkdb
crawl_dir/segments*
nutch merge crawl_dir/merged_indexes  crawl_dir/indexes
crawl_dir/new_indexes

This creates these two files:

segments_2  segments.gen

But what I want would seem to be a file named index or index.done.  Does
anyone have any advice?

Best,
John