You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brian Demers <br...@gmail.com> on 2007/08/09 17:04:20 UTC
intranet recrawl 0.9
All,
Does anyone have an updated recrawl script for 0.9?
Also, does anyone have a link that describes each phase of a crawl /
recrawl (for 0.9)
it looks like it changes each version. I searched the wiki, but i am
still unclear.
thanks
Re: index time for lucene
Posted by Erick Erickson <er...@gmail.com>.
this is probably a more appropriate question for the Lucene user's list.
Have you searched any of the documents on the Lucene websiet? e.g.
http://lucene.apache.org/java/docs/benchmarks.html
and
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
That said, it's impossible to answer your question, you'll have to
try and measure. There are too many variables. How big is
each document? How many fields are you indexing? What
size are they? What hardware are you running on? How much
RAM do you have? What do you require for speed? etc. etc.
Even if you provided an answer to those questions we'd
be able to give you no more than a WAG. So I recommend
you make a simple indexer and measure.
Best
Erick
On 9/12/07, Dmitry <dm...@hotmail.com> wrote:
>
> I would be interested in what the indexing times for Lucene to index about
> 700,000 document objects? Are there any ways to improve this time? On what
> indexation time depends?
>
> thanks,
> DT
> www.ejinz.com
>
>
index time for lucene
Posted by Dmitry <dm...@hotmail.com>.
I would be interested in what the indexing times for Lucene to index about
700,000 document objects? Are there any ways to improve this time? On what
indexation time depends?
thanks,
DT
www.ejinz.com
Re: intranet recrawl 0.9
Posted by Susam Pal <su...@gmail.com>.
I have written this script to crawl with Nutch 0.9. Though, I have
tried to take care that this should work for re-crawls as well, but I
have never done any real world testing for re-crawls. I use this to
crawl.
You may try this out. We can make some changes if this is not found to
be appropriate for re-crawls.
Regards,
Susam Pal
http://susam.in/
#!/bin/sh
# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
# If executed in 'safe' mode, it doesn't delete the temporary
# directories generated during crawl. This might be helpful for
# analysis and recovery in case a crawl fails.
#
# Author: Susam Pal
depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value
# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi
if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if [ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/opt/apache-tomcat-6.0.10
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi
if [ -n "$topN" ]
then
topN="--topN $rank"
else
topN=""
fi
steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN
-adddays $adddays
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth $depth failed. Deleting it."
rm -rf $segment
continue
fi
$NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done
echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
rm -rf crawl/segments/*
else
mkdir crawl/FETCHEDsegments
mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi
mv --verbose crawl/MERGEDsegments/* crawl/segments
rmdir crawl/MERGEDsegments
echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*
echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*
echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
if [ "$safe" != "yes" ]
then
rm -rf crawl/NEWindexes
fi
echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
echo Done!
else
echo runbot: Can not reload index in safe mode.
echo runbot: Please reload it manually using the following command:
echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi
echo "runbot: FINISHED: Crawl completed!"
On 8/9/07, Brian Demers <br...@gmail.com> wrote:
> All,
>
> Does anyone have an updated recrawl script for 0.9?
>
> Also, does anyone have a link that describes each phase of a crawl /
> recrawl (for 0.9)
>
> it looks like it changes each version. I searched the wiki, but i am
> still unclear.
>
> thanks
>