You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jacob Brunson <ja...@gmail.com> on 2006/09/16 04:11:13 UTC

Recrawl Script segment merging

I'm looking over the Intranet Recrawl script here:
http://wiki.apache.org/nutch/IntranetRecrawl
and I'm a little confused about segment merging and deleting.

####Start code snip####
# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir

for segment in `ls -d $segments_dir/* | tail -$depth`
do
  echo "Removing Temporary Segment: $segment"
  rm -rf $segment
done

cp -R $mergesegs_dir/* $segments_dir
rm -rf $mergesegs_dir
####End code snip####

What I understand that this does, it is merges ALL segments into a new
segment, deletes the NEW segments from the recrawl, and then adds the
new merged segment to the existing ones.

For example, if I had existing segment1 and segment2,
then the recrawl creates segment3 and segment4
then we merge all the segments into mergedsegment1-2-3-4
then delete the new segment3 and segment4
and copies mergedsegment1-2-3-4
so that in the segments dir we now have segment1, segment2, and
mergedsegment1-2-3-4

It seems to me that we should either be merging only the new segments,
or we should be deleting all existing segments.  Can someone confirm
this or explain to me what in fact the script is doing?

-- 
http://JacobBrunson.com