You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2010/08/20 00:49:23 UTC
Nutch Recrawl
Hi all,
I am using the following script to do my re-crawl. This is basically a
slightly modified version of the script that is found here.
http://wiki.apache.org/nutch/Crawl
I have a small site that I would like to crawl using this script may be
3 times a day on a windows server by scheduling the script using the
windows scheduled task feature. The actual plan is to create a windows
batch file and call the CygWin bash tool and supply my script to it as
shown below
C:\cygwin\bin\bash.exe -l nutchrecrawl
Where " nutchrecrawl" is the file containing the script below.
Is the overall approach I am taking correct to achieve my objective?
How does Nutch determine that there are old documents that have been
updated and hence need to be crawled.
I also read that " db.fetch.interval.default" property controls how and
when Nutch decides to re-crawl an existing document and the default is
30 days.
Let us say I change " db.fetch.interval.default" value from 30 days to
say 20 minutes (12000 seconds).
Then I run the script the first time and index the results into
Solr/Lucene index.
Then I go to my site and make a text change to an existing page (already
indexed during the first script run) site and immediately re-run the
script and index the results again to Solr/Lucene index. Assuming that
these steps happened within 20 minutes then I should not see the change
I made to the page in the index yet. If I run the script again the
third time after 20 minutes has passed then I should see my change in
the index. Is my understanding correct?
I also read about -adddays argument that could be added to the
'generate' step. How does this option work?
Sorry for long email. But I wanted to make sure I provide all the
information to make it easy to understand the issue. I really
appreciate your help.
Thanks
Raj
**************************************************************
depth=2
threads=50
adddays=0
#topN=15 #Comment this statement if you don't want to set topN value
# Arguments for rm and mv
RMARGS="-rf"
MVARGS="--verbose"
# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi
if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=/cygdrive/c/users/rnemani.turner/nutch
cd /cygdrive/c/users/rnemani.turner/nutch
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if [ -n "$topN" ]
then
topN="-topN $topN"
else
topN=""
fi
steps=7
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth `expr $i + 1` failed."
echo "runbot: Deleting segment $segment."
rm $RMARGS $segment
continue
fi
$NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done
echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS crawl/segments
else
rm $RMARGS crawl/BACKUPsegments
mv $MVARGS crawl/segments crawl/BACKUPsegments
fi
mv $MVARGS crawl/MERGEDsegments crawl/segments
echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*
echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb
\
crawl/segments/*
echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes
#echo "----- Loading New Index (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
rm $RMARGS crawl/NEWindexes
rm $RMARGS crawl/index
else
rm $RMARGS crawl/BACKUPindexes
rm $RMARGS crawl/BACKUPindex
mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes
mv $MVARGS crawl/index crawl/BACKUPindex
fi
mv $MVARGS crawl/NEWindex crawl/index
echo "runbot: FINISHED: Crawl completed!"
echo ""
******************************************************************