You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/08/20 15:22:44 UTC

[Nutch Wiki] Update of "Crawl" by susam

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/Crawl

The comment on the change is:
crawl script

New page:
== Introduction ==
This is a script to crawl an Internet or the web. It does not crawl using the 'bin/crawl' tool or 'Crawl' class present in Nutch, therefore the filters present in 'conf/crawl-urlfilter.txt ' has not effect on this script. The filters for this script must be set in 'regex-urlfilter.txt'.

== Steps ==
The complete job of this script has been divided broadly into 8 steps.

 # Inject URLs
 # Generate, Fetch, Parse, Update Loop
 # Merge Segments
 # Invert Links
 # Index
 # Dedup
 # Merge Indexes
 # Reload index

== Modes of Execution ==
The script can be executed in two modes:-
 * Normal Mode
 * Safe Mode

=== Normal Mode ===
If the script is executed with the command 'bin/runbot', it will delete all the directories such as fetched segments, generated indexes, etc, so as to save space. It will also reload the index after it finishes crawling and the new crawl DB would go live.

'''Caution:''' This also means that if something has gone wrong during the crawl and the resultant crawl DB is corrupt or incomplete, it might not return results for any query. And since this crawl DB would go live in 'normal mode', your visitors may not see any results.

=== Safe Mode ===
Alternatively, the script can be executed in safe mode as 'bin/runbot safe' which will prevent deletion of these directories.
If errors occur, you can take recovery action because the directories haven't been deleted. You can then manually merge the segments, generate indexes, etc. from the directories and make the resultant crawl DB go live.

Safe Mode also suppresses the automatic reloading of the new index. Therefore, the resultant crawl DB does not go live immediately after crawling. This gives you a chance to first test the new crawl DB for valid results. If it is found to work, you can make this new DB go live.

=== Normal Mode vs. Safe Mode ===
Ideally, you should run the script in safe mode a couple of times, to make sure the crawl is running fine. If you are sure, that everything will go fine, you need not run it in safe mode.

== Tinkering ==
Adjust the variables, 'depth', 'threads', 'adddays' and 'topN' as per your needs. Delete or comment out the statement for 'topN' assignment if you do not wish to set a 'topN' value.

=== NUTCH_HOME ===
If you are not executing the script as 'bin/runbot' from Nutch directory, you should either set the environment variable 'NUTCH_HOME' or edit the following in the script:-

{{{if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=.}}}

Set 'NUTCH_HOME' to the path of the Nutch directory (if you are not setting it as an environment variable, since if environment variable is set, the above assignment is ignored).

=== CATALINA_HOME ===
'CATALINA_HOME' points to the Tomcat installation directory. You must either set this as an environment variable or set it by editing the following lines in the script:-

{{{if [ -z "$CATALINA_HOME" ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10}}}

Similar to the previous section, if this variable is set in the environment, then the above assignment is ignored.

== Can it re-crawl? ==
The author has used this script to re-crawl a couple of times. However, no real world testing has been done for re-crawling. Therefore, you may try to use the script of re-crawl. If it works out fine or it doesn't work properly for re-crawl, please let us know.

== Script ==
{{{
#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
#        If executed in 'safe' mode, it doesn't delete the temporary
#        directories generated during crawl. This might be helpful for
#        analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value

# Parse arguments
if [ "$1" == "safe" ]
then
  safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script 
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME 
fi

if [ -z "$CATALINA_HOME" ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script 
else
  echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME 
fi

if [ -n "$topN" ]
then
  topN="--topN $rank"
else
  topN=""
fi

steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN -adddays $adddays
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth $depth failed. Deleting segment $segment."
    rm -rf $segment
    continue
  fi

  #$NUTCH_HOME/bin/nutch parse $segment
  $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done

echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
  rm -rf crawl/segments/*
else
  mkdir crawl/FETCHEDsegments
  mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi

mv --verbose crawl/MERGEDsegments/* crawl/segments
rmdir crawl/MERGEDsegments

echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes

if [ "$safe" != "yes" ]
then
  rm -rf crawl/NEWindexes
fi

echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
  touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
  echo Done!
else
  echo runbot: Can not reload index in safe mode.
  echo runbot: Please reload it manually using the following command:
  echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi

echo "runbot: FINISHED: Crawl completed!"
}}}