You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Steven Yelton <st...@missiondata.com> on 2006/02/24 18:30:10 UTC
Incremental search of a single domain
This has all probably been hashed out ad nauseam, but I haven't seen an
end-to-end howto on what I am trying to do. If I can get all the kinks
worked out (and understand all the pieces), I'll be glad to write one.
I have a domain that has several hundred thousand documents. I would
like to:
* Setup an initial index and db using the crawl tool (to some
reasonable depth) to get me started
* Hookup the NutchBean to actually do the searches
* Continually crawl the 'next 1000 (or so) links' daily to go
'deeper' into the site. Refresh the index after each of these
incremental searches
* Keep the pages fresh (no more than 15 days old)
* Remove pages when they disappear from the server
* Use a finite amount of resources
Here is what I have so far:
* nutch crawl myurls -dir myindex -depth 5
This creates 5 segments with:
Number of pages: 32509
Number of links: 545061
I assume this this means that I have fetched and indexed 32509
pages and found 545061 links in the process (does this mean that I have
512552 pages to go?)
* Setup the NutchBean to serve searches
* Change db.default.fetch.interval=15
* Daily, create a new segment, index it, dedup, and merge it into the
main index
# Grab some pages and update the database
nutch generate index/db index/segments -topN 1000
s1=`ls -d index/segments/2* | tail -1`
nutch fetch $s1
nutch updatedb index/db $s1
#update the database with the segments
nutch updatesegs index/db index/segments index/workdir
#index for the additional segment we added
nutch index $s1 -dir index/workdir
#delete duplicate content
nutch dedup index/workdir index/segments
#merge all segments into the master index
nutch merge -workingdir index/workdir index/index index/segments/2*
rm -rf ${dir}/index/workdir
* Tell the search server that the index changed ('reload' the NutchBean)
This all seems to work well and I happily do this for 15 days.
<time passes>
Now as I understand it nutch will see that a page is older than 15 days
and will refetch it and put in in one of my new segments. The old
segment is ignored and the page inside the new segment will be used.
Finally, my questions:
* I now have over 600k links and 40k pages in my database. How can I
get nutch to fetch existing content (make sure its fresh) instead of
fetching new content? Is there a deterministic approach nutch takes (or
a way to influence it)?
* Is there any way to know when I can safely delete a segment? That
is how can I make sure all the pages in an old segment have been fetched
in a subsequent one?
* I see some mention of inverting links in the Internet crawl. This
isn't done in the 0.7.1 crawltool (which I used to develop my
incremental updates). Why would I want/need to do this in my situation
(a single site crawl)?
* Is there anything fundamentally wrong (or even screwy) with a setup
like this? Are my assumptions correct? I realize that with these
numbers I will never 'catch up' with the initial crawl if all I am doing
is refreshing content (I guess I can do another 'big' segment each week,
or something)
Sorry for the long post, and thanks in advance!
Steven