You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ph...@comcast.net on 2005/09/27 21:20:19 UTC
Phasing out old segments
Hi All,
I have been using nutch for a while now with a 300M+ urls crawled in the
last few months. So we have a lot of segments of mixed new data and recrawl
data. While it is assumed safe to delete segments older than the refresh
rate, it is not certain that all the urls in the old segments have been
recrawled given the sheer number of urls that are in the database.
Some of the older segments also contain the top level homepages of many of the
domains so I'd like to be sure that these are refreshed in another newer
segment.
Is anybody tackling this problem?
If not, I have been thinking of building the following tool:
1) Read a segment or a collection of segments.
2) Compare each url to its entry in webdb:
If the url was marked with -adddays from a previous fetchlist generation,
ignore it.
If the url was not accessible at last crawl, add it to list.
If the url was last crawled with a 200 or 30? status longer than refresh rate
days ago, add to list. (Not sure about 303 pages, )
3) sort the list and run it against the url filters.
4) generate a fetchlist with these urls. If there are no urls in this list,
then this segment is ready for deletion.
This part can be just a report on the state of the segments or can generate
the segment without marking it in webdb.
I'd welcome any comment.
If others find it useful, I will be happy to post it once it's done.
Phoebe.