You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Zoltán Zvara <zo...@gmail.com> on 2017/11/10 15:12:39 UTC
db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13
Dear Community,
db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
Other configurations are:
db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
db.fetch.schedule.adaptive.min_interval = "86400"
db.fetch.schedule.adaptive.inc_rate = "0.4"
db.fetch.schedule.adaptive.dec_rate = "0.2"
db.fetch.schedule.adaptive.sync_delta = "true"
db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
Thanks,
Zoltán
Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch
1.13
Posted by Zoltán Zvara <zo...@gmail.com>.
We got the problem. Looking into the code of `AdaptiveFetchSchedule`, a `defaultInterval` will be used for the first time for each record, which is evaluated from configuration parameter "db.fetch.interval.default". It was not set in our configuration, and `AbstractFetchSchedule` implementation takes 0, which forced a re-fetch in every consecutive fetch phase. Sneaky. :-)
To avoid banal issues like this, default values in-code should be the same to the defaults of "nutch-site.xml".
Otherwise you never know what will happen.
Cheers,
Zoltán
On 2017-11-18 15:48:06, Zoltán Zvara <zo...@gmail.com> wrote:
Hi Sebastian,
We tried it but sites still get fetched every 1-2 hours, which is roughly one iteration.
Any other ideas? Maybe on how to debug it?
Thanks,
Zoltán
On 2017-11-12 15:34:45, Sebastian Nagel <wa...@googlemail.com> wrote:
Hi Zoltán,
it's probably a bug (NUTCH-1564), try to set sync_delta to false.
Best,
Sebastian
On 11/10/2017 04:12 PM, Zoltán Zvara wrote:
> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>
Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch
1.13
Posted by Zoltán Zvara <zo...@gmail.com>.
Hi Sebastian,
We tried it but sites still get fetched every 1-2 hours, which is roughly one iteration.
Any other ideas? Maybe on how to debug it?
Thanks,
Zoltán
On 2017-11-12 15:34:45, Sebastian Nagel <wa...@googlemail.com> wrote:
Hi Zoltán,
it's probably a bug (NUTCH-1564), try to set sync_delta to false.
Best,
Sebastian
On 11/10/2017 04:12 PM, Zoltán Zvara wrote:
> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>
Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch
1.13
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Zoltán,
it's probably a bug (NUTCH-1564), try to set sync_delta to false.
Best,
Sebastian
On 11/10/2017 04:12 PM, Zoltán Zvara wrote:
> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>