You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Brian Demers <br...@gmail.com> on 2007/08/09 17:04:20 UTC

intranet recrawl 0.9

All,

Does anyone have an updated recrawl script for 0.9?

Also, does anyone have a link that describes each phase of a crawl /
recrawl (for 0.9)

it looks like it changes each version.  I searched the wiki, but i am
still unclear.

thanks

Re: index time for lucene

Posted by Erick Erickson <er...@gmail.com>.

this is probably a more appropriate question for the Lucene user's list.

Have you searched any of the documents on the Lucene websiet? e.g.
http://lucene.apache.org/java/docs/benchmarks.html
and
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

That said, it's impossible to answer your question, you'll have to
try and measure. There are too many variables. How big is
each document? How many fields are you indexing? What
size are they? What hardware are you running on? How much
RAM do you have? What do you require for speed? etc. etc.

Even if you provided an answer to those questions we'd
be able to give you no more than  a WAG. So I recommend
you make a simple indexer and measure.

Best
Erick

On 9/12/07, Dmitry <dm...@hotmail.com> wrote:
>
> I would be interested in what the indexing times for Lucene to index about
> 700,000 document objects? Are there any ways to improve this time? On what
> indexation time depends?
>
> thanks,
> DT
> www.ejinz.com
>
>

index time for lucene

Posted by Dmitry <dm...@hotmail.com>.

I would be interested in what the indexing times for Lucene to index about 
700,000 document objects? Are there any ways to improve this time? On what 
indexation time depends?

thanks,
DT
www.ejinz.com

Re: intranet recrawl 0.9

Posted by Susam Pal <su...@gmail.com>.

I have written this script to crawl with Nutch 0.9. Though, I have
tried to take care that this should work for re-crawls as well, but I
have never done any real world testing for re-crawls. I use this to
crawl.

You may try this out. We can make some changes if this is not found to
be appropriate for re-crawls.

Regards,
Susam Pal
http://susam.in/

#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
#        If executed in 'safe' mode, it doesn't delete the temporary
#        directories generated during crawl. This might be helpful for
#        analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value

# Parse arguments
if [ "$1" == "safe" ]
then
  safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi

if [ -z "$CATALINA_HOME" ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
  echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi

if [ -n "$topN" ]
then
  topN="--topN $rank"
else
  topN=""
fi

steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN
-adddays $adddays
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth $depth failed. Deleting it."
    rm -rf $segment
    continue
  fi

  $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done

echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
  rm -rf crawl/segments/*
else
  mkdir crawl/FETCHEDsegments
  mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi

mv --verbose crawl/MERGEDsegments/* crawl/segments
rmdir crawl/MERGEDsegments

echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes

if [ "$safe" != "yes" ]
then
  rm -rf crawl/NEWindexes
fi

echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
  touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
  echo Done!
else
  echo runbot: Can not reload index in safe mode.
  echo runbot: Please reload it manually using the following command:
  echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi

echo "runbot: FINISHED: Crawl completed!"

On 8/9/07, Brian Demers <br...@gmail.com> wrote:
> All,
>
> Does anyone have an updated recrawl script for 0.9?
>
> Also, does anyone have a link that describes each phase of a crawl /
> recrawl (for 0.9)
>
> it looks like it changes each version.  I searched the wiki, but i am
> still unclear.
>
> thanks
>