You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Mathijs Homminga <ma...@kalooga.com> on 2012/02/28 14:09:25 UTC
[nutchgora] AbstractFetchSchedule.forceFetch method resets fetch status
Hi,
Does anyone know why the AbstractFetchSchedule.forceFetch method sets the page.status to STATUS_UNFETCHED?
The DbUpdateReducer calls this method when the page.fetchInterval exceeds the (current) db.fetch.interval.max.
As I understand it, we call this method to keep all fetchIntervals in the webtable within the current maximum, but why reset the page status?
I bumped into this because my db.fetch.interval.default > db.fetch.interval.max ;))
After a couple of successful crawl cycles, all of my webpages still were STATUS_UNFETCHED.
Cheers,
Mathijs
Re: [nutchgora] AbstractFetchSchedule.forceFetch method resets fetch status
Posted by Mathijs Homminga <ma...@kalooga.com>.
Yes, thanks.
It is related. However, it does not apply to DB_GONE pages (only), but to all pages that have their fetchInterval > max interval.
Actually, I'm still a bit puzzled by the scheduling related parameters and the way the AbstractFetchSchedule handles them.
Why do pages with a fetchInterval > maxInterval suddenly have to be fetched?
I would say that if we encounter such pages, we correct the fetchInterval (set it to the maxInterval) and leave it there. Also, I would suggest that we only do this at DbUpdate time.
Mathijs
On Feb 28, 2012, at 14:41 , Markus Jelsma wrote:
> https://issues.apache.org/jira/browse/NUTCH-578
> https://issues.apache.org/jira/browse/NUTCH-1245
>
> Is you issue similar to these?
>
> On Tuesday 28 February 2012 14:09:25 Mathijs Homminga wrote:
>> Hi,
>>
>> Does anyone know why the AbstractFetchSchedule.forceFetch method sets the
>> page.status to STATUS_UNFETCHED?
>>
>> The DbUpdateReducer calls this method when the page.fetchInterval exceeds
>> the (current) db.fetch.interval.max. As I understand it, we call this
>> method to keep all fetchIntervals in the webtable within the current
>> maximum, but why reset the page status?
>>
>> I bumped into this because my db.fetch.interval.default >
>> db.fetch.interval.max ;)) After a couple of successful crawl cycles, all
>> of my webpages still were STATUS_UNFETCHED.
>>
>> Cheers,
>> Mathijs
>
> --
> Markus Jelsma - CTO - Openindex
Re: [nutchgora] AbstractFetchSchedule.forceFetch method resets fetch status
Posted by Markus Jelsma <ma...@openindex.io>.
https://issues.apache.org/jira/browse/NUTCH-578
https://issues.apache.org/jira/browse/NUTCH-1245
Is you issue similar to these?
On Tuesday 28 February 2012 14:09:25 Mathijs Homminga wrote:
> Hi,
>
> Does anyone know why the AbstractFetchSchedule.forceFetch method sets the
> page.status to STATUS_UNFETCHED?
>
> The DbUpdateReducer calls this method when the page.fetchInterval exceeds
> the (current) db.fetch.interval.max. As I understand it, we call this
> method to keep all fetchIntervals in the webtable within the current
> maximum, but why reset the page status?
>
> I bumped into this because my db.fetch.interval.default >
> db.fetch.interval.max ;)) After a couple of successful crawl cycles, all
> of my webpages still were STATUS_UNFETCHED.
>
> Cheers,
> Mathijs
--
Markus Jelsma - CTO - Openindex