You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Steven Yelton <st...@missiondata.com> on 2006/02/24 18:30:10 UTC

Incremental search of a single domain

This has all probably been hashed out ad nauseam, but I haven't seen an 
end-to-end howto on what I am trying to do.  If I can get all the kinks 
worked out (and understand all the pieces), I'll be glad to write one.

I have a domain that has several hundred thousand documents.  I would 
like to:
   * Setup an initial index and db using the crawl tool (to some 
reasonable depth) to get me started
   * Hookup the NutchBean to actually do the searches
   * Continually crawl the 'next 1000 (or so) links' daily to go 
'deeper' into the site.  Refresh the index after each of these 
incremental searches
   * Keep the pages fresh (no more than 15 days old)
   * Remove pages when they disappear from the server
   * Use a finite amount of resources

Here is what I have so far:
   * nutch crawl myurls -dir myindex -depth 5
        This creates 5 segments with:
           Number of pages: 32509
           Number of links: 545061
     I assume this this means that I have fetched and indexed 32509 
pages and found 545061 links in the process (does this mean that I have 
512552 pages to go?)

   * Setup the NutchBean to serve searches

   * Change db.default.fetch.interval=15

   * Daily, create a new segment, index it, dedup, and merge it into the 
main index
       # Grab some pages and update the database
       nutch generate index/db index/segments -topN  1000
       s1=`ls -d index/segments/2* | tail -1`
       nutch fetch $s1
       nutch updatedb index/db $s1

       #update the database with the segments
       nutch updatesegs index/db index/segments index/workdir

       #index for the additional segment we added
       nutch index $s1 -dir index/workdir

       #delete duplicate content
       nutch dedup index/workdir index/segments

       #merge all segments into the master index
       nutch merge -workingdir index/workdir index/index index/segments/2*

       rm -rf ${dir}/index/workdir    

   * Tell the search server that the index changed ('reload' the NutchBean)


This all seems to work well and I happily do this for 15 days. 

<time passes>

Now as I understand it nutch will see that a page is older than 15 days 
and will refetch it and put in in one of my new segments.  The old 
segment is ignored and the page inside the new segment will be used.

Finally, my questions:
   * I now have over 600k links and 40k pages in my database.  How can I 
get nutch to fetch existing content (make sure its fresh) instead of 
fetching new content?  Is there a deterministic approach nutch takes (or 
a way to influence it)?
   * Is there any way to know when I can safely delete a segment?  That 
is how can I make sure all the pages in an old segment have been fetched 
in a subsequent one? 
   * I see some mention of inverting links in the Internet crawl.  This 
isn't done in the 0.7.1 crawltool (which I used to develop my 
incremental updates).  Why would I want/need to do this in my situation 
(a single site crawl)?
   * Is there anything fundamentally wrong (or even screwy) with a setup 
like this?  Are my assumptions correct?  I realize that with these 
numbers I will never 'catch up' with the initial crawl if all I am doing 
is refreshing content (I guess I can do another 'big' segment each week, 
or something)



Sorry for the long post, and thanks in advance!
Steven