You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/28 21:40:52 UTC

Recrawling... methodology?

I need some help clarifying if recrawling is doing exactly what I think
it is. Here's the current scenario of how I think a recrawl should work:

I crawl my intranet with a depth of 2. Later, I recrawl using the script
found below:
http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

In my recrawl, I also specify a depth of 2. It reindexes each of the
pages before, and if they have changed update the pages content. If they
have changed and new links exist, the links are followed to a maximum
depth of 2.

This is how I think a typical recrawl should work. However, when I
recrawl using the script linked to above, tons of new pages are indexed,
whether they have changed or not. It seems as if I crawl the content
with a depth of 2, and then come back and recrawl with a depth of 2, it
really adds a couple of crawl depth levels and the outcome is that I
have done a crawl with a depth of 4 (instead of crawl with a depth of 2
and then just a recrawl to catch any new pages).

The current steps of the recrawl are as follows:
for (how many depth levels specified)

$nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment

invertlinks
index
dedup
merge

Basically what made me wonder is that it took me 2 minutes to do the
crawl. It's taken me over 3 hours and still going to do the recrawl
(same depth levels specified). After I recrawl once, I believe it then
speeds up.

Thanks for any feedback you can offer,
Matt

------------------------------------------------------------------------