You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/07 18:54:20 UTC

retry later

when you get an error while fetching, and you get the
org.apache.nutch.protocol.retrylater because the max retries have been
reached, nutch says it has given up and will retry later, when does that
retry occur?  How would you make a fetchlist of all urls that have
failed?  Is this information maintained somewhere?
 

Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software

 

Re: retry later

Posted by Doug Cutting <cu...@apache.org>.
Richard Braman wrote:
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur?  How would you make a fetchlist of all urls that have
> failed?  Is this information maintained somewhere?

Each url in the crawldb has a retry count, the number of times it has 
been tried without a conclusive result.  When the maximum 
(db.fetch.retry.max) then the page is considered gone.  Until then it 
will be generated for fetch along with other pages.  There is no command 
that generates a fetchlist for only pages whose retry count is greater 
than zero.

Doug

Re: retry later

Posted by mos <mo...@gmail.com>.
Hi Andrzej,

Thanks, for going into this subject.
I'm glad that this issue will be resolved in version 0.8. That make's
me hopeful. :)

Sure, fixing this bug in version 0.7.1 wouldn't be necessary if the new
version 0.8 will be available in the next weeks.
And the workaround for me works until then: Just make complete
recrawls and doesn't resuse the existing web-db of a previous crawl.
;)

Greetings
Oliver


On 3/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Thanks for your persistance on this subject... ;-) I agree, it's a real
> issue. Most developers (myself included) concentrate on 0.8 branch now,
> which has a fix for this.
>
> Basically, the whole premise of pages "truly gone" seems to be
> ill-defined. If we can't reach a page even 1000 times during a given
> period it doesn't automatically mean it's truly gone, it could mean that
> the server is temporarily down and we tried too often in a given
> period... so, as long as the links from other pages are valid we should
> still from time to time attempt to check the status of that page.
>
> That's the reasoning behind the fix that went to 0.8 - if the last fetch
> was long time ago (longer than a maximum interval for the installation)
> then we force refetch anyway, and if it doesn't succeed we just increase
> the interval by 50%.
>
> Now, fixing this the same way in 0.7 would mean that pages no longer end
> up in PAGE_GONE state. Is this a fix of broken behavior or a new
> behavior (new feature)? I'm not sure...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: retry later

Posted by Andrzej Bialecki <ab...@getopt.org>.
mos wrote:
>> when you get an error while fetching, and you get the
>> org.apache.nutch.protocol.retrylater because the max retries have been
>> reached, nutch says it has given up and will retry later, when does that
>> retry occur?
>>     
>
> That's an issue I reported some weeks ago and which is in my opinion
> an annoying bug in Nutch 0.7.1:
>
> Nutch says that it "will retry later" those pages. In reality the next
> fetch date
> is set to infinite and those pages are lost forever.
> In consequence this means that pages which are temporary not available,
> would be never indexed when doing recrawls.
> That's the reason why the recrawl on bases of an existing webdb doesn't make
> sense witch Nutch 0.7.1.  To make sure that temporary not available pages
> are considered, you have to make a complete new crawl of all pages
> (and throw away the old crawl).
>
> I mentioned this issue on this list a few times and reported this issue on Jira:
> http://issues.apache.org/jira/browse/NUTCH-205
>
> Unfortunality no nutch-developer seems to be interested in this
> serious issue.....
>   

Thanks for your persistance on this subject... ;-) I agree, it's a real 
issue. Most developers (myself included) concentrate on 0.8 branch now, 
which has a fix for this.

Basically, the whole premise of pages "truly gone" seems to be 
ill-defined. If we can't reach a page even 1000 times during a given 
period it doesn't automatically mean it's truly gone, it could mean that 
the server is temporarily down and we tried too often in a given 
period... so, as long as the links from other pages are valid we should 
still from time to time attempt to check the status of that page.

That's the reasoning behind the fix that went to 0.8 - if the last fetch 
was long time ago (longer than a maximum interval for the installation) 
then we force refetch anyway, and if it doesn't succeed we just increase 
the interval by 50%.

Now, fixing this the same way in 0.7 would mean that pages no longer end 
up in PAGE_GONE state. Is this a fix of broken behavior or a new 
behavior (new feature)? I'm not sure...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: retry later

Posted by mos <mo...@gmail.com>.
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur?

That's an issue I reported some weeks ago and which is in my opinion
an annoying bug in Nutch 0.7.1:

Nutch says that it "will retry later" those pages. In reality the next
fetch date
is set to infinite and those pages are lost forever.
In consequence this means that pages which are temporary not available,
would be never indexed when doing recrawls.
That's the reason why the recrawl on bases of an existing webdb doesn't make
sense witch Nutch 0.7.1.  To make sure that temporary not available pages
are considered, you have to make a complete new crawl of all pages
(and throw away the old crawl).

I mentioned this issue on this list a few times and reported this issue on Jira:
http://issues.apache.org/jira/browse/NUTCH-205

Unfortunality no nutch-developer seems to be interested in this
serious issue.....

Greetings
Oliver




On 3/7/06, Richard Braman <rb...@bramantax.com> wrote:
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur?  How would you make a fetchlist of all urls that have
> failed?  Is this information maintained somewhere?
>
>
> Richard Braman
> mailto:rbraman@taxcodesoftware.org
> 561.748.4002 (voice)
>
> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
> Free Open Source Tax Software
>
>
>
>