You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/30 17:31:01 UTC

Adaptive fetch interval & unmodified content detection, episode II

Hi,

I've been working on a set of patches to implement this functionality 
for the mapred branch.

I have a workable solution now, but before I decide to commit it I'd 
like to solicit some comments. Please see the latest patch available 
from JIRA NUTCH-61.

Based on the past discussions, I decided to implement a maximum limit 
for fetch interval, after which pages are unconditionally refetched, 
even if they are marked as UNMODIFIED. The reason for this is that pages 
could be stuck in this state for a very long time, and in the meantime 
the segments that contain copies of such pages could be expired (deleted 
or lost).

All protocol plugins have been changed to check for content 
modification, and return a specific status if it's unmodified, avoiding 
fetching the actual content.

Modification is also checked based on a page signature, using the 
recently added pluggable signature implementations.

The main remaining doubt that I have is about the adaptive fetch 
interval functionality. The patch contains a framework for pluggable 
FetchSchedule implementations, which modify the fetch interval and the 
next fetch time based on the following information:

* previous fetch time
* previous modification time (may be 0 if unknown)
* previous fetch interval
* current fetch time
* current modification time (may be 0 if unknown)
* a boolean value "changed", based on checking the page signatures (old 
vs. new), if the page's content is available

For efficiency reasons, most of this information is stored and passed to 
processing jobs inside instances of CrawlDatum - for the key step of DB 
update any other parts of segments (such as Content, ParseData or 
ParseText) are not used, which prevents easy access to other page 
metadata. For now, I added both the signature and the modifiedTime to 
CrawlDatum as separate attributes, but I'm considering to put them (and 
any other values that users might want to add to CrawlDB) into a 
Properties attribute.

The reason for this is that the reality may be more complicated than 
this simple model above. Various sites use additional information to 
control re-fetching, besides the "Last-Modified" that we use now, such as:

* Expire header
* ETag header
* Caching headers
* page metadata

Additionally, some schemes for phasing out old segments might want to 
store some segment information inside the CrawlDb, such as the last 
segment name, where the latest copy can be found.

So, I'll hold off with committing these patches until we can reach some 
agreement how to proceed. We should keep as little information in 
CrawlDB as possible, but no less than it's necessary... ;-)

Please review the patches and play around with them - they work properly 
even now.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Adaptive fetch interval & unmodified content detection, episode II

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> For efficiency reasons, most of this information is stored and passed to 
> processing jobs inside instances of CrawlDatum - for the key step of DB 
> update any other parts of segments (such as Content, ParseData or 
> ParseText) are not used, which prevents easy access to other page 
> metadata. For now, I added both the signature and the modifiedTime to 
> CrawlDatum as separate attributes, but I'm considering to put them (and 
> any other values that users might want to add to CrawlDB) into a 
> Properties attribute.

Yes, I agree that CrawlDatum should have extensible properties.  If 
these are empty, then no Properties instance should be allocated.

This is great stuff.  I look forward to getting it committed!

Doug