You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Zoltán Zvara <zo...@gmail.com> on 2017/11/10 15:12:39 UTC

db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Dear Community,

db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?

Other configurations are:
db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
db.fetch.schedule.adaptive.min_interval = "86400"
db.fetch.schedule.adaptive.inc_rate = "0.4"
db.fetch.schedule.adaptive.dec_rate = "0.2"
db.fetch.schedule.adaptive.sync_delta = "true"
db.fetch.schedule.adaptive.sync_delta_rate = "0.3"

On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1

Thanks,
Zoltán

Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Posted by Zoltán Zvara <zo...@gmail.com>.
We got the problem. Looking into the code of `AdaptiveFetchSchedule`, a `defaultInterval` will be used for the first time for each record, which is evaluated from configuration parameter "db.fetch.interval.default". It was not set in our configuration, and `AbstractFetchSchedule` implementation takes 0, which forced a re-fetch in every consecutive fetch phase. Sneaky. :-)

To avoid banal issues like this, default values in-code should be the same to the defaults of "nutch-site.xml".
Otherwise you never know what will happen.

Cheers,
Zoltán

On 2017-11-18 15:48:06, Zoltán Zvara <zo...@gmail.com> wrote:
Hi Sebastian,

We tried it but sites still get fetched every 1-2 hours, which is roughly one iteration.

Any other ideas? Maybe on how to debug it?

Thanks,
Zoltán
On 2017-11-12 15:34:45, Sebastian Nagel <wa...@googlemail.com> wrote:
Hi Zoltán,

it's probably a bug (NUTCH-1564), try to set sync_delta to false.

Best,
Sebastian

On 11/10/2017 04:12 PM, Zoltán Zvara wrote:
> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>


Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Posted by Zoltán Zvara <zo...@gmail.com>.
Hi Sebastian,

We tried it but sites still get fetched every 1-2 hours, which is roughly one iteration.

Any other ideas? Maybe on how to debug it?

Thanks,
Zoltán
On 2017-11-12 15:34:45, Sebastian Nagel <wa...@googlemail.com> wrote:
Hi Zoltán,

it's probably a bug (NUTCH-1564), try to set sync_delta to false.

Best,
Sebastian

On 11/10/2017 04:12 PM, Zoltán Zvara wrote:
> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>


Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Zoltán,

it's probably a bug (NUTCH-1564), try to set sync_delta to false.

Best,
Sebastian

On 11/10/2017 04:12 PM, Zoltán Zvara wrote:
> Dear Community,
> 
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
> 
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
> 
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
> 
> Thanks,
> Zoltán
>