You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by og...@yahoo.com on 2005/09/15 17:36:25 UTC
Re: [Nutch-general] Re: [nutch] - http.max.delays: retry later issue?

Lukas,

THe end of your messages shows that you got things right.  Yes, in that
scenarios your fetching will take longer, simply because one host so
many more URLs than others.

As for what is considered a host, I believe it's a full host name (e.g.
www.example.com), not a domain nor the IP (even though, as you say, a
single host name may actually be mapped to multiple physical machines,
and a single physical server can host multiple distinct hosts).

Otis

--- Lukas Vlcek <lu...@gmail.com> wrote:

> Hi Doug,
> 
> I see what you mean and it meakes very sense now.
> 
> However, this leads me to the question what exactly
> fetcher.threads.per.host value is use for? More specifically what
> *host* means in Nutch configuration world?
> 
> Does it mean that if fetcher.threads.per.host value is set to 1 then
> concurrent crawling of two documents from the same domain name is
> forbidden (e.g.: http://www.abc.com/1.html and
> http://www.abc.com/2.html) while in fact these two documents might be
> physically located on two different servers without our knowledge?
> 
> On the other hand one physical server can be assigned multiple domain
> names so crawling for http://www.abc.com/1.html and
> http://www.xyz.com/1.html concurrently means that the same server
> could be in charge.
>  
> When setting fetcher.threads.per.host value what should I have on my
> mind: DNS domain name (meaning just $1 from
> http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)?
> 
> Also I can see that this subject has been already discussed in
> NUTCH-69 ticket but no solution was made.
> 
> I don't want to make this message longer but imagine the following
> situation:
> I start crawling with three urls in mind like:
> url1 (contains 2 pages),
> url2 (50 pages),
> url3 (2000 pages)
> Now, when crawling is started with 3 threads then after url1 is
> crawled then one thread becomes redundant and error rate starts
> growing. After url2 is crawled (potentially not fully due to thread
> collision) there are three treads leaft for one huge url3 only. This
> means that I can't make url3 to be crawled fully because we are not
> able to avoid thread collision in spite of the fact three threads
> were
> needed at the beginning.
> 
> Anyway, thanks for answer!
> Lukas
> 
> On 9/14/05, Doug Cutting <cu...@nutch.org> wrote:
> > Lukas Vlcek wrote:
> > > 050913 113818 fetching http://xxxx_some_page_xxxx.html
> > > org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays:
> retry later.
> > 
> > This page will be retried the next time the fetcher is run.
> > 
> > This message means that this thread has waited http.max.delays
> times for
> > fetcher.server.delay seconds, and each time it found that another
> thread
> > was already accessing a page at this site.  To avoid these,
> increase
> > http.max.delays to a larger number, or, if you're crawling only
> servers
> > that you control, set fetcher.threads.per.host to something greater
> than
> > one (making the fetcher faster, but impolite).
> > 
> > Doug
> >
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server.
> Download it for free - -and be entered to win a 42" plasma tv or your
> very
> own Sony(tm)PSP.  Click here to play:
> http://sourceforge.net/geronimo.php
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>