You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Max Lynch <ih...@gmail.com> on 2010/08/03 22:19:02 UTC

Nutch script feedback

Hello,
Now that I'm getting a hang of Nutch, I've started building a simple script
that satisfies my crawling needs.  However, the concept of repeating crawls
and how Nutch deals with duplicates and such is still not clear to me.

Basically, I've got a set of seed urls already injected, so that's not part
of this script, but I would like to constantly hit the same domains to
update my index when new documents are found.  I have been successful in
restricting crawls to those domains, so that's not a problem.  Here is my
script:

#!/bin/bash
export JAVA_HOME=/usr/lib/jvm/java-6-sun
echo "Using $1 as the crawl folder"

set -e

lockfile="/tmp/crawl.lock"

if [ ! -e "$lockfile" ]; then
    touch "$lockfile"
else
    echo "Already running!"
    exit
fi

# Go to a depth of 5
for i in 1 2 3 4 5
do
s1=`ls -d $1/segments/2* | tail -1`
echo $s1
bin/nutch generate $1/crawldb $1/segments -topN 50000
s1=`ls -d $1/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1 -noParsing -threads 100
bin/nutch parse $s1
bin/nutch updatedb $1/crawldb $s1 -filter -normalize
done
bin/nutch invertlinks $1/linkdb -dir $1/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ $1/crawldb $1/linkdb
$1/segments/*


What happens with my segments and fetches after this script runs?  If I run
it again, will new segments be created that possibly contain duplicate
documents or links that other segments already had?  Do I need to run
mergesegs?  Again, my sole goal is to constantly hit a set of domains and
find new content when it is available.  As such I'm not really concerned
with search depth, just that I'm getting most of the pages.

I would greatly appreciate any feedback or help.  Thanks.