You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2006/01/06 20:43:25 UTC

Re: Adaptive fetch interval & unmodified content detection, episode II

Andrzej Bialecki wrote:
> For efficiency reasons, most of this information is stored and passed to 
> processing jobs inside instances of CrawlDatum - for the key step of DB 
> update any other parts of segments (such as Content, ParseData or 
> ParseText) are not used, which prevents easy access to other page 
> metadata. For now, I added both the signature and the modifiedTime to 
> CrawlDatum as separate attributes, but I'm considering to put them (and 
> any other values that users might want to add to CrawlDB) into a 
> Properties attribute.

Yes, I agree that CrawlDatum should have extensible properties.  If 
these are empty, then no Properties instance should be allocated.

This is great stuff.  I look forward to getting it committed!

Doug