You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ph...@comcast.net on 2005/09/27 21:20:19 UTC

Phasing out old segments

Hi All,
	I have been using nutch for a while now with a 300M+ urls  crawled in the 
last few months. So we have a lot of segments of mixed new data and recrawl 
data. While it is assumed safe to delete segments older than the refresh 
rate, it is not certain that all the urls in the old segments have been 
recrawled given the sheer number of urls that are in the database.
Some of the older segments also contain the top level homepages of many of the 
domains so I'd like to be sure that these are refreshed in another newer 
segment. 

Is anybody tackling this problem? 
If not, I have been thinking of building the following tool:

1) Read a segment or a collection of segments.
2) Compare each url to its entry in webdb:
	If the url was marked with -adddays from a previous fetchlist generation, 
ignore it.
	If the url was not accessible at last crawl, add it to list.
	If the url was last crawled with a 200 or 30? status longer than refresh rate 
days ago, add to list. (Not sure about 303 pages,  )	
3) sort the list and run it against the url filters.
4) generate a fetchlist with these urls. If there are no urls in this list, 
then this segment is ready for deletion. 
This part can be just a report on the state of the segments or can generate 
the segment without marking it in webdb.

I'd welcome any comment. 
If others find it useful, I will be happy to post it once it's done.

Phoebe.