You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/28 21:40:52 UTC

Recrawling... methodology?

I need some  help clarifying if recrawling is doing exactly what I think 
it is.  Here's the current scenario of how I think a recrawl should work:

I crawl my intranet with a depth of 2. Later, I recrawl using the script 
found below: 
http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 


In my recrawl, I also specify a depth of 2. It reindexes each of the 
pages before, and if they have changed update the pages content. If they 
have changed and new links exist, the links are followed to a maximum 
depth of 2.

This is how I think a typical recrawl should work. However, when I 
recrawl using the script linked to above, tons of new pages are indexed, 
whether they have changed or not. It seems as if I crawl the content 
with a depth of 2, and then come back and recrawl with a depth of 2, it 
really adds a couple of crawl depth levels and the outcome is that I 
have done a crawl with a depth of 4 (instead of crawl with a depth of 2 
and then just a recrawl to catch any new pages).

The current steps of the recrawl are as follows:
for (how many depth levels specified)

 $nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 $nutch_dir/nutch fetch $segment
 $nutch_dir/nutch updatedb $webdb_dir $segment

invertlinks
index
dedup
merge

Basically what made me wonder is that it took me 2 minutes to do the 
crawl. It's taken me over 3 hours and still going to do the recrawl 
(same depth levels specified). After I recrawl once, I believe it then 
speeds up.

Thanks for any feedback you can offer,
 Matt

------------------------------------------------------------------------