You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/08/20 15:22:44 UTC
[Nutch Wiki] Update of "Crawl" by susam
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by susam:
http://wiki.apache.org/nutch/Crawl
The comment on the change is:
crawl script
New page:
== Introduction ==
This is a script to crawl an Internet or the web. It does not crawl using the 'bin/crawl' tool or 'Crawl' class present in Nutch, therefore the filters present in 'conf/crawl-urlfilter.txt ' has not effect on this script. The filters for this script must be set in 'regex-urlfilter.txt'.
== Steps ==
The complete job of this script has been divided broadly into 8 steps.
# Inject URLs
# Generate, Fetch, Parse, Update Loop
# Merge Segments
# Invert Links
# Index
# Dedup
# Merge Indexes
# Reload index
== Modes of Execution ==
The script can be executed in two modes:-
* Normal Mode
* Safe Mode
=== Normal Mode ===
If the script is executed with the command 'bin/runbot', it will delete all the directories such as fetched segments, generated indexes, etc, so as to save space. It will also reload the index after it finishes crawling and the new crawl DB would go live.
'''Caution:''' This also means that if something has gone wrong during the crawl and the resultant crawl DB is corrupt or incomplete, it might not return results for any query. And since this crawl DB would go live in 'normal mode', your visitors may not see any results.
=== Safe Mode ===
Alternatively, the script can be executed in safe mode as 'bin/runbot safe' which will prevent deletion of these directories.
If errors occur, you can take recovery action because the directories haven't been deleted. You can then manually merge the segments, generate indexes, etc. from the directories and make the resultant crawl DB go live.
Safe Mode also suppresses the automatic reloading of the new index. Therefore, the resultant crawl DB does not go live immediately after crawling. This gives you a chance to first test the new crawl DB for valid results. If it is found to work, you can make this new DB go live.
=== Normal Mode vs. Safe Mode ===
Ideally, you should run the script in safe mode a couple of times, to make sure the crawl is running fine. If you are sure, that everything will go fine, you need not run it in safe mode.
== Tinkering ==
Adjust the variables, 'depth', 'threads', 'adddays' and 'topN' as per your needs. Delete or comment out the statement for 'topN' assignment if you do not wish to set a 'topN' value.
=== NUTCH_HOME ===
If you are not executing the script as 'bin/runbot' from Nutch directory, you should either set the environment variable 'NUTCH_HOME' or edit the following in the script:-
{{{if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.}}}
Set 'NUTCH_HOME' to the path of the Nutch directory (if you are not setting it as an environment variable, since if environment variable is set, the above assignment is ignored).
=== CATALINA_HOME ===
'CATALINA_HOME' points to the Tomcat installation directory. You must either set this as an environment variable or set it by editing the following lines in the script:-
{{{if [ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/opt/apache-tomcat-6.0.10}}}
Similar to the previous section, if this variable is set in the environment, then the above assignment is ignored.
== Can it re-crawl? ==
The author has used this script to re-crawl a couple of times. However, no real world testing has been done for re-crawling. Therefore, you may try to use the script of re-crawl. If it works out fine or it doesn't work properly for re-crawl, please let us know.
== Script ==
{{{
#!/bin/sh
# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
# If executed in 'safe' mode, it doesn't delete the temporary
# directories generated during crawl. This might be helpful for
# analysis and recovery in case a crawl fails.
#
# Author: Susam Pal
depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value
# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi
if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if [ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/opt/apache-tomcat-6.0.10
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi
if [ -n "$topN" ]
then
topN="--topN $rank"
else
topN=""
fi
steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN -adddays $adddays
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth $depth failed. Deleting segment $segment."
rm -rf $segment
continue
fi
#$NUTCH_HOME/bin/nutch parse $segment
$NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done
echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
rm -rf crawl/segments/*
else
mkdir crawl/FETCHEDsegments
mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi
mv --verbose crawl/MERGEDsegments/* crawl/segments
rmdir crawl/MERGEDsegments
echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*
echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/*
echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
if [ "$safe" != "yes" ]
then
rm -rf crawl/NEWindexes
fi
echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
echo Done!
else
echo runbot: Can not reload index in safe mode.
echo runbot: Please reload it manually using the following command:
echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi
echo "runbot: FINISHED: Crawl completed!"
}}}