You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sourajit Basak <so...@gmail.com> on 2012/08/14 09:56:19 UTC

adaptive fetches

What is "adaptive fetch schedule" as dictated by the property *
db.fetch.schedule.adaptive.sync_delta* ? If this is set to true how does
property *db.fetch.interval.default* come to effect ?

I guess the 'generate' phase checks for the modified timestamp of every
page in the crawldb. If a page does change, Nutch decides whether to
re-fetch based on the property - "*
db.fetch.schedule.adaptive.sync_delta_rate*". Is this assumption correct ?

If yes, what does the default fetch interval mean in this context. The
re-fetch seems to be affected for such cases by how often I run "generate".

RE: adaptive fetches

Posted by j....@thomsonreuters.com.
Not experienced but this may help a bit...

The fetchTime field is used by Mapper to decide if it is time to fetch
this url. For a well written overview see this link
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

Also see the Nutch API documentation for AbstractFetchSchedule
athttp://nutch.apache.org/apidocs-2.0/org/apache/nutch/crawl/AbstractFet
chSchedule.html#setFetchSchedule%28java.lang.String,%20org.apache.nutch.
storage.WebPage,%20long,%20long,%20long,%20long,%20int%29 

The default re-fetch schedule is somewhat simplistic. No matter if the
page was changed or not, the fetchInterval remains unchanged, and the
updated page fetchTime will always be set to fetchTime + fetchInterval *
1000 (a month with Nutch 2.0). See 
http://nutch.apache.org/apidocs-2.0/org/apache/nutch/crawl/DefaultFetchS
chedule.html

A better implementation for most cases is the AdaptiveFetchSchedule
AdaptiveFetchSchedule. The FetchSchedule implementation can be changed
by copying the db.fetch.schedule.class property from
conf/nutch-default.xml to conf/nutch-site.xml and changing the value.
http://nutch.apache.org/apidocs-2.0/org/apache/nutch/crawl/AdaptiveFetch
Schedule.html

 
-----Original Message-----
From: Sourajit Basak [mailto:sourajit.basac@gmail.com] 
Sent: Tuesday, August 14, 2012 5:21 PM
To: user@nutch.apache.org
Subject: Re: adaptive fetches

On a second thought, it doesn't seem that the 'generate' phase checks
for the modified timestamp of every page. It seems to be pre-calculated
by a previous generate-fetch-update cycle.

Experienced guys can comment on how a next fetch time is calculated.
>From the crawldb output, it seems to have added a month to the last
fetch time, though I only checked my target site's home pages.

On Tue, Aug 14, 2012 at 1:26 PM, Sourajit Basak
<so...@gmail.com>wrote:

> What is "adaptive fetch schedule" as dictated by the property *
> db.fetch.schedule.adaptive.sync_delta* ? If this is set to true how 
> does property *db.fetch.interval.default* come to effect ?
>
> I guess the 'generate' phase checks for the modified timestamp of 
> every page in the crawldb. If a page does change, Nutch decides 
> whether to re-fetch based on the property - "* 
> db.fetch.schedule.adaptive.sync_delta_rate*". Is this assumption
correct ?
>
> If yes, what does the default fetch interval mean in this context. The

> re-fetch seems to be affected for such cases by how often I run
"generate".
>

Re: adaptive fetches

Posted by Sourajit Basak <so...@gmail.com>.
On a second thought, it doesn't seem that the 'generate' phase checks for
the modified timestamp of every page. It seems to be pre-calculated by a
previous generate-fetch-update cycle.

Experienced guys can comment on how a next fetch time is calculated. From
the crawldb output, it seems to have added a month to the last fetch time,
though I only checked my target site's home pages.

On Tue, Aug 14, 2012 at 1:26 PM, Sourajit Basak <so...@gmail.com>wrote:

> What is "adaptive fetch schedule" as dictated by the property *
> db.fetch.schedule.adaptive.sync_delta* ? If this is set to true how does
> property *db.fetch.interval.default* come to effect ?
>
> I guess the 'generate' phase checks for the modified timestamp of every
> page in the crawldb. If a page does change, Nutch decides whether to
> re-fetch based on the property - "*
> db.fetch.schedule.adaptive.sync_delta_rate*". Is this assumption correct ?
>
> If yes, what does the default fetch interval mean in this context. The
> re-fetch seems to be affected for such cases by how often I run "generate".
>