You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2006/01/06 20:43:25 UTC
Re: Adaptive fetch interval & unmodified content detection, episode
II
Andrzej Bialecki wrote:
> For efficiency reasons, most of this information is stored and passed to
> processing jobs inside instances of CrawlDatum - for the key step of DB
> update any other parts of segments (such as Content, ParseData or
> ParseText) are not used, which prevents easy access to other page
> metadata. For now, I added both the signature and the modifiedTime to
> CrawlDatum as separate attributes, but I'm considering to put them (and
> any other values that users might want to add to CrawlDB) into a
> Properties attribute.
Yes, I agree that CrawlDatum should have extensible properties. If
these are empty, then no Properties instance should be allocated.
This is great stuff. I look forward to getting it committed!
Doug