You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/08/04 23:33:05 UTC

Detecting unmodified content patches (Re: near-term plan)

Doug Cutting wrote:
> Andrzej Bialecki wrote:
> 
>> So, I would propose a deadline of Aug 8 for the last commits, and then 
>> perhaps Aug 15 for the release?
> 
> 
> Sounds good to me.  Thanks for helping with this!

Unfortunately, the patches related to detecting the unmodified content 
will have to wait until after the release.

Here's the problem: It's quite easy to add this checking and recording 
capability to all fetcher plugins, fetchlist generation and db update 
tools, and I've done this in my local patches. However, after a while I 
discovered a serious problem in the way Nutch currently manages "phasing 
out" of old segment data. If we assume that we always refresh after some 
fixed interval (30 days, or whatever), then we can safely delete 
segments older than 30 days. If the interval varies, then potentially we 
could be stuck with some segments with very old (but still valid) data. 
This is very inefficient, because in a single given segment there might 
be only a couple of such pages left after a while, and the rest of them 
would have to be removed again and again by deduplication because newer 
pages would exist in newer segments.

Moreover (and this is the worst problem) if such segments are lost, the 
information in webdb must be updated in a way to force refetching, even 
though the "If-Modified-Since" or the MD5 points out that the page is 
still unchanged since the last fetching. Currently the only way to do 
this is to "add days" - but if we use a variable refetch interval then 
it doesn't make much sense. I think we need to track in a better way 
which pages are "missing" from the segments, and have to be re-fetched, 
or to have a better DB update mechanism if we lose some segments.

Perhaps we should extend the Page to record which segment holds the 
latest version of the page? But segments don't have unique ID's now (a 
directory name is too fragile and too easily changed) ...

Related question: in the FetchListEntry we have a "fetch" flag. I think 
that after minor modifications of the FetchListTool (to generate only 
entries, which we are supposed to fetch) we could get rid of this flag, 
or change its semantics to mean "unconditionally fetch, even if unmodified".

Any comments?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com