You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/30 17:31:01 UTC
Adaptive fetch interval & unmodified content detection, episode II
Hi,
I've been working on a set of patches to implement this functionality
for the mapred branch.
I have a workable solution now, but before I decide to commit it I'd
like to solicit some comments. Please see the latest patch available
from JIRA NUTCH-61.
Based on the past discussions, I decided to implement a maximum limit
for fetch interval, after which pages are unconditionally refetched,
even if they are marked as UNMODIFIED. The reason for this is that pages
could be stuck in this state for a very long time, and in the meantime
the segments that contain copies of such pages could be expired (deleted
or lost).
All protocol plugins have been changed to check for content
modification, and return a specific status if it's unmodified, avoiding
fetching the actual content.
Modification is also checked based on a page signature, using the
recently added pluggable signature implementations.
The main remaining doubt that I have is about the adaptive fetch
interval functionality. The patch contains a framework for pluggable
FetchSchedule implementations, which modify the fetch interval and the
next fetch time based on the following information:
* previous fetch time
* previous modification time (may be 0 if unknown)
* previous fetch interval
* current fetch time
* current modification time (may be 0 if unknown)
* a boolean value "changed", based on checking the page signatures (old
vs. new), if the page's content is available
For efficiency reasons, most of this information is stored and passed to
processing jobs inside instances of CrawlDatum - for the key step of DB
update any other parts of segments (such as Content, ParseData or
ParseText) are not used, which prevents easy access to other page
metadata. For now, I added both the signature and the modifiedTime to
CrawlDatum as separate attributes, but I'm considering to put them (and
any other values that users might want to add to CrawlDB) into a
Properties attribute.
The reason for this is that the reality may be more complicated than
this simple model above. Various sites use additional information to
control re-fetching, besides the "Last-Modified" that we use now, such as:
* Expire header
* ETag header
* Caching headers
* page metadata
Additionally, some schemes for phasing out old segments might want to
store some segment information inside the CrawlDb, such as the last
segment name, where the latest copy can be found.
So, I'll hold off with committing these patches until we can reach some
agreement how to proceed. We should keep as little information in
CrawlDB as possible, but no less than it's necessary... ;-)
Please review the patches and play around with them - they work properly
even now.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch interval & unmodified content detection, episode
II
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> For efficiency reasons, most of this information is stored and passed to
> processing jobs inside instances of CrawlDatum - for the key step of DB
> update any other parts of segments (such as Content, ParseData or
> ParseText) are not used, which prevents easy access to other page
> metadata. For now, I added both the signature and the modifiedTime to
> CrawlDatum as separate attributes, but I'm considering to put them (and
> any other values that users might want to add to CrawlDB) into a
> Properties attribute.
Yes, I agree that CrawlDatum should have extensible properties. If
these are empty, then no Properties instance should be allocated.
This is great stuff. I look forward to getting it committed!
Doug