You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/20 00:25:50 UTC

Reworked recrawl script for 0.8.0

Hi all,
 I reworked the recrawl script for 0.7.2 
(http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html) 
for nutch-0.8.0-dev.

I thought I had it refactored completely, and it doesn't error out, but 
I must be calling some of the commands in the inproper order. Can you 
please take a look at it and see if you can spot what is wrong?? Thanks.

        Matt

#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi

webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/newsegs
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

mkdir $segments_dir

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index 
$new_indexes $webdb_dir $linkdb_dir

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args    expected
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes