You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/07 18:54:20 UTC
retry later
when you get an error while fetching, and you get the
org.apache.nutch.protocol.retrylater because the max retries have been
reached, nutch says it has given up and will retry later, when does that
retry occur? How would you make a fetchlist of all urls that have
failed? Is this information maintained somewhere?
Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice)
http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software
Re: retry later
Posted by Doug Cutting <cu...@apache.org>.
Richard Braman wrote:
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur? How would you make a fetchlist of all urls that have
> failed? Is this information maintained somewhere?
Each url in the crawldb has a retry count, the number of times it has
been tried without a conclusive result. When the maximum
(db.fetch.retry.max) then the page is considered gone. Until then it
will be generated for fetch along with other pages. There is no command
that generates a fetchlist for only pages whose retry count is greater
than zero.
Doug
Re: retry later
Posted by mos <mo...@gmail.com>.
Hi Andrzej,
Thanks, for going into this subject.
I'm glad that this issue will be resolved in version 0.8. That make's
me hopeful. :)
Sure, fixing this bug in version 0.7.1 wouldn't be necessary if the new
version 0.8 will be available in the next weeks.
And the workaround for me works until then: Just make complete
recrawls and doesn't resuse the existing web-db of a previous crawl.
;)
Greetings
Oliver
On 3/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Thanks for your persistance on this subject... ;-) I agree, it's a real
> issue. Most developers (myself included) concentrate on 0.8 branch now,
> which has a fix for this.
>
> Basically, the whole premise of pages "truly gone" seems to be
> ill-defined. If we can't reach a page even 1000 times during a given
> period it doesn't automatically mean it's truly gone, it could mean that
> the server is temporarily down and we tried too often in a given
> period... so, as long as the links from other pages are valid we should
> still from time to time attempt to check the status of that page.
>
> That's the reasoning behind the fix that went to 0.8 - if the last fetch
> was long time ago (longer than a maximum interval for the installation)
> then we force refetch anyway, and if it doesn't succeed we just increase
> the interval by 50%.
>
> Now, fixing this the same way in 0.7 would mean that pages no longer end
> up in PAGE_GONE state. Is this a fix of broken behavior or a new
> behavior (new feature)? I'm not sure...
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
Re: retry later
Posted by Andrzej Bialecki <ab...@getopt.org>.
mos wrote:
>> when you get an error while fetching, and you get the
>> org.apache.nutch.protocol.retrylater because the max retries have been
>> reached, nutch says it has given up and will retry later, when does that
>> retry occur?
>>
>
> That's an issue I reported some weeks ago and which is in my opinion
> an annoying bug in Nutch 0.7.1:
>
> Nutch says that it "will retry later" those pages. In reality the next
> fetch date
> is set to infinite and those pages are lost forever.
> In consequence this means that pages which are temporary not available,
> would be never indexed when doing recrawls.
> That's the reason why the recrawl on bases of an existing webdb doesn't make
> sense witch Nutch 0.7.1. To make sure that temporary not available pages
> are considered, you have to make a complete new crawl of all pages
> (and throw away the old crawl).
>
> I mentioned this issue on this list a few times and reported this issue on Jira:
> http://issues.apache.org/jira/browse/NUTCH-205
>
> Unfortunality no nutch-developer seems to be interested in this
> serious issue.....
>
Thanks for your persistance on this subject... ;-) I agree, it's a real
issue. Most developers (myself included) concentrate on 0.8 branch now,
which has a fix for this.
Basically, the whole premise of pages "truly gone" seems to be
ill-defined. If we can't reach a page even 1000 times during a given
period it doesn't automatically mean it's truly gone, it could mean that
the server is temporarily down and we tried too often in a given
period... so, as long as the links from other pages are valid we should
still from time to time attempt to check the status of that page.
That's the reasoning behind the fix that went to 0.8 - if the last fetch
was long time ago (longer than a maximum interval for the installation)
then we force refetch anyway, and if it doesn't succeed we just increase
the interval by 50%.
Now, fixing this the same way in 0.7 would mean that pages no longer end
up in PAGE_GONE state. Is this a fix of broken behavior or a new
behavior (new feature)? I'm not sure...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: retry later
Posted by mos <mo...@gmail.com>.
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur?
That's an issue I reported some weeks ago and which is in my opinion
an annoying bug in Nutch 0.7.1:
Nutch says that it "will retry later" those pages. In reality the next
fetch date
is set to infinite and those pages are lost forever.
In consequence this means that pages which are temporary not available,
would be never indexed when doing recrawls.
That's the reason why the recrawl on bases of an existing webdb doesn't make
sense witch Nutch 0.7.1. To make sure that temporary not available pages
are considered, you have to make a complete new crawl of all pages
(and throw away the old crawl).
I mentioned this issue on this list a few times and reported this issue on Jira:
http://issues.apache.org/jira/browse/NUTCH-205
Unfortunality no nutch-developer seems to be interested in this
serious issue.....
Greetings
Oliver
On 3/7/06, Richard Braman <rb...@bramantax.com> wrote:
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur? How would you make a fetchlist of all urls that have
> failed? Is this information maintained somewhere?
>
>
> Richard Braman
> mailto:rbraman@taxcodesoftware.org
> 561.748.4002 (voice)
>
> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
> Free Open Source Tax Software
>
>
>
>