You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by lin weijian <li...@gmail.com> on 2012/08/15 14:11:36 UTC

A FetchSchedule bug makes fetch time becoming more and more big

Hi,
	When DbUpdateReducer executes, it will call setFetchSchedule for a fetched page. This function will
add fetch interval to the new fetch time, no matter if it has been added up. It makes the fetch time becoming more and more big.    It's should add fetch interval to last fetch time.

    Thanks.

Re: A FetchSchedule bug makes fetch time becoming more and more big

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

Yeah this is something I noticed too some while ago. Although it does not
directly break the crawling directly, it is not a nice implementation.
Notice that the Generator tries to correct for fetchtime too far off in the
future. (In the AbstractFetchSchedule shouldFetch method.)

As a matter of fact I have refactored the updating process slightly to only
update the fetchtime once. (Directly after a fetch that is). The best part
is that this change allows for running several generate-fetch cycles
without running the updater every time. There is a slight downside but I
will post it in the issue after I have attached a patch for this
improvement:
https://issues.apache.org/jira/browse/NUTCH-1457

Ferdy.

On Wed, Aug 15, 2012 at 2:11 PM, lin weijian <li...@gmail.com> wrote:

>
> Hi,
> When DbUpdateReducer executes, it will call setFetchSchedule for a
> fetched page. This function will
> add fetch interval to the new fetch time, no matter if it has been added
> up. It makes the fetch time becoming more and more big.    It's should add
> fetch interval to last fetch time.
>
>     Thanks.
>