You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/21 01:35:16 UTC

Can nutch be used as link checker? What does http.max.delay error mean?

Is there a way to configure nutch to show the
pages that have broken links?

In the default setting, the crawl log lists the URL
that is being fetched and failed.  But it does not
tell me which page has that broken link.

By the way what, does 
"org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry
later"
mean?

I increated the http.max.delapys parameter to 100 but I still see this.
How large does the value need to be for a medium sized inranet web site
for
a small company? Is the error saying that I should rerun the crawler,
or is it simply informing me that it will try again in the same session?

-kuro

Re: Can nutch be used as link checker? What does http.max.delay error mean?

Posted by Stefan Groschupf <sg...@media-style.com>.
Am 21.12.2005 um 01:35 schrieb Teruhiko Kurosaka:

> Is there a way to configure nutch to show the
> pages that have broken links?

Well, you may can hack nutch to do this, but this makes less sense.  
There are some other tools for that.
>
> In the default setting, the crawl log lists the URL
> that is being fetched and failed.  But it does not
> tell me which page has that broken link.
>
> By the way what, does
> "org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry
> later"
> mean?
The page was not fetched (e.g. since you only fetch one host and has  
many threads, but only one thread per host configured)
>
> I increated the http.max.delapys parameter to 100 but I still see  
> this.
> How large does the value need to be for a medium sized inranet web  
> site
> for
> a small company? Is the error saying that I should rerun the crawler,
> or is it simply informing me that it will try again in the same  
> session?

if you fetch only one host, it is a good idea t have  
fetcher.threads.per.host and fetcher.threads.fetch identically.

HTH
Stefan