You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Enzo Michelangeli <en...@gmail.com> on 2007/06/11 02:58:53 UTC

Incremental indexing

As the size of my data keeps growing, and the indexing time grows even
faster, I'm trying to switch from a "reindex all at every crawl" model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
indexes and perhaps linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes references
to segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that
I can call "bin/nutch merge" with only two parameters: the original index
directory as destination, and the directory to be merged in the former:

  $nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called 
$index_dir/merge_output . Shouldn't I instead create a new empty destination 
directory, do the merge, and then replace the original with the newly merged 
directory:

  merged_indexes=$crawl_dir/merged_indexes
  rm -rf $merged_indexes # just in case it's already there
  $nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes
  rm -rf $index_dir.old # just in case it's already there
  mv $index_dir $index_dir.old
  mv $merged_indexes $index_dir
  rm -rf $index_dir.old

3. Regarding linkdb, does running "$nutch_dir/nutch invertlinks" on the 
latest segment only, and then merging the newly obtained linkdb with the 
current one with "$nutch_dir/nutch mergelinkdb", make sense rather than 
recreating linkdb afresh from the whole set of segments every time? In other 
words, can invertlinks work incrementally, or does it need to have a view of 
all segments in order to work correctly?

Thanks,

Enzo