You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lukas Vlcek <lu...@gmail.com> on 2005/09/14 10:30:05 UTC

[nutch] - http.max.delays: retry later issue?

Hi,
I am trying to figure out why some of html pages didn't get crawled
and it seems to me that there may be some issues in Nutch.

I believe that the following parameters are the most important in my
case when fatching pages with nutch0.7:

<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>3</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

<property>
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

If I understand it corrrectly then nutch should try to fetch page at
least three times where there shouldn't be less then 5 seconds between
individual attempts.

However if I look into crawl log file I can see that one particular
page didn't get in index due to there are only two error messages:

error #1:

050913 113818 fetching http://xxxx_some_page_xxxx.html
org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
        at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
        at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)

error#2:

050913 113959 fetch of http://xxxx_some_page_xxxx.html failed with:
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
http.max.delays: retry later

I would expect that after these two exceptions there should be one
other record; either it be next error or information that page has
been successfully fetched. But no other message in log file can be
found and page is NOT in index after fetching is finished.

Can anyone explain me what I am understanding wrong?

Regards,
Lukas

Re: [nutch] - http.max.delays: retry later issue?

Posted by Doug Cutting <cu...@nutch.org>.

Matthias Jaekle wrote:
> Adjusting the amount of downloads dynamically according to the response
> time should be great.
> 
> But where is the advantage doing this per unique name?

So that we can get a larger portion of the content on sites that do have 
a lot of capacity.  Serializing and delaying are expensive for the 
fetcher, so we'd like to avoid these when they're not needed.  A dynamic 
delay helps avoid unneeded delays, but does not permit multiple threads. 
  If the "industry standard" is to serialize on hostnames, not IPs, then 
needn't punish ourselves by serializing on IPs.  If by serializing on 
hostnames, multiple threads end up accessing the same IP, then a dynamic 
delay should make all threads slow down to an inoffensive rate.  We want 
to be as polite, but no more polite than is needed.  Serializing on IPs 
needlessly limits our fetching rate for fast sites.

Also, note that Nutch currently partitions fetchlists by hostname, not 
IP, so that multiple fetchers may already access the same IP.  So we are 
not currently consistent.

Doug

Re: [Nutch-general] Re: [nutch] - http.max.delays: retry later issue?

Posted by Kelvin Tan <ke...@relevanz.com>.

You know, OC/nutch-84 already provides these mechanisms, i.e. via the DefaultFetchList class

1. Block by hostname
2. Configurable wait time by time taken to download.

And this is a good example where, if Matthias' requirements are unique, he can always implement a new FetchList which blocks by IP. No point trying to please everyone..

In Nutch-speak, I guess the FetchList has to be an extension point.

k

On Wed, 21 Sep 2005 21:07:28 +0200, Matthias Jaekle wrote:
>>�So most other crawlers use the hostname, not the ip. �That's good
>>�to
>>
>�know.
>
>�google and yahoo, Yes. The others I am not sure.
>
>>�Perhaps a dynamic property would help. �If the elapsed time of
>>�the previous request is some fraction of the delay then we might
>>�lessen the delay. �Similarly, if it is greater or if we get 503s,
>>�then we might increase it. �For example, if the fraction were .5
>>�and the delay is 2 seconds, then sites which respond faster than
>>�a second would get their delay decreased, and sites which respond
>>�in more than a second or that return 503 would have their delay
>>�increased. �Do you think this would be effective with your site?
>>
>
>�Adjusting the amount of downloads dynamically according to the
>�response time should be great.
>
>�But where is the advantage doing this per unique name?
>
>�If there is no real reason to do so, I would do it dynamically per
>�IP or second level domain, but not per sub domain.
>
>�Matthias
>
>
>�------------------------------------------------------- SF.Net
>�email is sponsored by:
>�Tame your development challenges with Apache's Geronimo App Server.
>�Download it for free - -and be entered to win a 42" plasma tv or
>�your very own Sony(tm)PSP. �Click here to play:
>�http://sourceforge.net/geronimo.php
>�_______________________________________________ Nutch-general
>�mailing list Nutch-general@lists.sourceforge.net
>�https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [nutch] - http.max.delays: retry later issue?

Posted by Matthias Jaekle <ja...@eventax.de>.

> So most other crawlers use the hostname, not the ip.  That's good to 
know.

google and yahoo, Yes. The others I am not sure.

> Perhaps a dynamic property would help.  If the elapsed time of the 
> previous request is some fraction of the delay then we might lessen the 
> delay.  Similarly, if it is greater or if we get 503s, then we might 
> increase it.  For example, if the fraction were .5 and the delay is 2 
> seconds, then sites which respond faster than a second would get their 
> delay decreased, and sites which respond in more than a second or that 
> return 503 would have their delay increased.  Do you think this would be 
> effective with your site?

Adjusting the amount of downloads dynamically according to the response
time should be great.

But where is the advantage doing this per unique name?

If there is no real reason to do so, I would do it dynamically per IP or
second level domain, but not per sub domain.

Matthias

Re: [Nutch-general] Re: [nutch] - http.max.delays: retry later issue?

Posted by og...@yahoo.com.

> Perhaps a dynamic property would help.  If the elapsed time of the 
> previous request is some fraction of the delay then we might lessen
> the 
> delay.  Similarly, if it is greater or if we get 503s, then we might 
> increase it.  For example, if the fraction were .5 and the delay is 2
> 
> seconds, then sites which respond faster than a second would get
> their 
> delay decreased, and sites which respond in more than a second or
> that 
> return 503 would have their delay increased.  Do you think this would
> be 
> effective with your site?

This sounds like something that might be effective for all sites.  Nice
and adaptive, just like that refetch frequency calculation.

Otis

Re: [nutch] - http.max.delays: retry later issue?

Posted by Doug Cutting <cu...@nutch.org>.

Matthias Jaekle wrote:
> All big search engines seems to think that each subdomain is a single 
> host and stress each subdomain as single host.

So most other crawlers use the hostname, not the ip.  That's good to 
know.  So if we did make this change, we would not be less polite than 
others.

> In our case that's bad. We could not handle one request for each 
> subdomain at the same time. So we have to answer with 503 much 
> searchengines according to the system load. Specially it is very bad, if 
> all the big search engines crawl our subdomains at same time.
> 
> So using IP adresses to limit the access to one host is much better.

Better for your site, yes, but for some sites, which have the capacity, 
it would be better to be more aggressive than we are now, so that we can 
crawl more of the site.  I've been recently running dmoz-seeded crawls 
of the whole web, and find it hard to use much bandwidth and not get a 
lot of "http.max.delays exceeded" errors, meaning I'm simply unable to 
fetch much of many popular sites.

Perhaps a dynamic property would help.  If the elapsed time of the 
previous request is some fraction of the delay then we might lessen the 
delay.  Similarly, if it is greater or if we get 503s, then we might 
increase it.  For example, if the fraction were .5 and the delay is 2 
seconds, then sites which respond faster than a second would get their 
delay decreased, and sites which respond in more than a second or that 
return 503 would have their delay increased.  Do you think this would be 
effective with your site?

Doug

Re: [nutch] - http.max.delays: retry later issue?

Posted by Matthias Jaekle <ja...@eventax.de>.

>> In this case, a host is an IP address.
> I've thought about this more, and wonder if perhaps this should be 
> switched so that host name are blocked from simultaneous fetching rather 
> than IP addresses.  I recently spoke with Carlos Castillo, author of the 
> WIRE crawler (http://www.cwr.cl/projects/WIRE/) and it blocks hosts by 
> name, not IP.  What do others think?

I think IP adresses are much better.

We are splitting one of our big websites to many subdomains.
We think our customers could realize our urls in this case better.
I think there are other people out there, doing the same.
All subdomains are handled by the same host / IP.

All big search engines seems to think that each subdomain is a single 
host and stress each subdomain as single host.
In our case that's bad. We could not handle one request for each 
subdomain at the same time. So we have to answer with 503 much 
searchengines according to the system load. Specially it is very bad, if 
all the big search engines crawl our subdomains at same time.

So using IP adresses to limit the access to one host is much better.

Matthias

Re: [nutch] - http.max.delays: retry later issue?

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

I didn't have a chance to look into code yet I am probably not the
best one to answer this question but if it is changed from IP to
hostname how would it help in my example scenario described few mails
above?

As far as I undestand this problem I think there is no way how crawler
can learn what physical server structure is behind specific domain
name (or IP address) [not to say that this can dynamically change
during fetching time, correct?]. On the other hand it can very well
learn how fast is the response (and in fact this is the ONLY
information it can learn, correct?). Then, all it is about is just
question how much aggressive/polite the crawler should be. I believe
that big companies (like Google) must have much sophisticated crawler
system because they never know what it is going to face to.

How about to implement new crawler's beaviour which is more dynamic
driven as opposed to static_number_of_threads? Somethink which allows
my to define max number of threads in total and per one host and some
factor which allows me to specify that if one host/IP response is
*slow* then lower number of concurrent threads per given host or in
total. [I know it won't be that easy... :-)]
This means there will be need to do more complicated analysis during
fetching process but I think it is worth.

Just my 2 cents.

Regards,
Lukas

On 9/21/05, Doug Cutting <cu...@nutch.org> wrote:
> Doug Cutting wrote:
> > Lukas Vlcek wrote:
> >
> >> However, this leads me to the question what exactly
> >> fetcher.threads.per.host value is use for? More specifically what
> >> *host* means in Nutch configuration world?
> >
> >
> > In this case, a host is an IP address.
> 
> I've thought about this more, and wonder if perhaps this should be
> switched so that host name are blocked from simultaneous fetching rather
> than IP addresses.  I recently spoke with Carlos Castillo, author of the
> WIRE crawler (http://www.cwr.cl/projects/WIRE/) and it blocks hosts by
> name, not IP.  What do others think?
> 
> Doug
>

Re: [nutch] - http.max.delays: retry later issue?

Posted by Doug Cutting <cu...@nutch.org>.

Doug Cutting wrote:
> Lukas Vlcek wrote:
> 
>> However, this leads me to the question what exactly
>> fetcher.threads.per.host value is use for? More specifically what
>> *host* means in Nutch configuration world?
> 
> 
> In this case, a host is an IP address.

I've thought about this more, and wonder if perhaps this should be 
switched so that host name are blocked from simultaneous fetching rather 
than IP addresses.  I recently spoke with Carlos Castillo, author of the 
WIRE crawler (http://www.cwr.cl/projects/WIRE/) and it blocks hosts by 
name, not IP.  What do others think?

Doug

Re: [nutch] - http.max.delays: retry later issue?

Posted by Doug Cutting <cu...@nutch.org>.

Lukas Vlcek wrote:
> However, this leads me to the question what exactly
> fetcher.threads.per.host value is use for? More specifically what
> *host* means in Nutch configuration world?

In this case, a host is an IP address.

> Does it mean that if fetcher.threads.per.host value is set to 1 then
> concurrent crawling of two documents from the same domain name is
> forbidden (e.g.: http://www.abc.com/1.html and
> http://www.abc.com/2.html) while in fact these two documents might be
> physically located on two different servers without our knowledge?

Since IP address is used, if a site uses round-robin DNS, then we could 
get two different IP addresses for the same host name and fetch them 
simultaneously.  However in practice the DNS lookup will probably be 
cached somewhere (by the JVM or by our DNS server) so that we'll almost 
always get the same address for a given host.

> On the other hand one physical server can be assigned multiple domain
> names so crawling for http://www.abc.com/1.html and
> http://www.xyz.com/1.html concurrently means that the same server
> could be in charge.

In this case only a single thread will be permitted to access this 
server at a time.

> When setting fetcher.threads.per.host value what should I have on my
> mind: DNS domain name (meaning just $1 from
> http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)?

IP address.

> I don't want to make this message longer but imagine the following situation:
> I start crawling with three urls in mind like:
> url1 (contains 2 pages),
> url2 (50 pages),
> url3 (2000 pages)
> Now, when crawling is started with 3 threads then after url1 is
> crawled then one thread becomes redundant and error rate starts
> growing. After url2 is crawled (potentially not fully due to thread
> collision) there are three treads leaft for one huge url3 only. This
> means that I can't make url3 to be crawled fully because we are not
> able to avoid thread collision in spite of the fact three threads were
> needed at the beginning.

If you set http.max.delays to a large value then you will get no errors. 
  All the threads will be used initially, then, as hosts are exhausted, 
threads will block each other.

Doug

Re: [Nutch-general] Re: [nutch] - http.max.delays: retry later issue?

Posted by og...@yahoo.com.

Lukas,

THe end of your messages shows that you got things right.  Yes, in that
scenarios your fetching will take longer, simply because one host so
many more URLs than others.

As for what is considered a host, I believe it's a full host name (e.g.
www.example.com), not a domain nor the IP (even though, as you say, a
single host name may actually be mapped to multiple physical machines,
and a single physical server can host multiple distinct hosts).

Otis

--- Lukas Vlcek <lu...@gmail.com> wrote:

> Hi Doug,
> 
> I see what you mean and it meakes very sense now.
> 
> However, this leads me to the question what exactly
> fetcher.threads.per.host value is use for? More specifically what
> *host* means in Nutch configuration world?
> 
> Does it mean that if fetcher.threads.per.host value is set to 1 then
> concurrent crawling of two documents from the same domain name is
> forbidden (e.g.: http://www.abc.com/1.html and
> http://www.abc.com/2.html) while in fact these two documents might be
> physically located on two different servers without our knowledge?
> 
> On the other hand one physical server can be assigned multiple domain
> names so crawling for http://www.abc.com/1.html and
> http://www.xyz.com/1.html concurrently means that the same server
> could be in charge.
>  
> When setting fetcher.threads.per.host value what should I have on my
> mind: DNS domain name (meaning just $1 from
> http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)?
> 
> Also I can see that this subject has been already discussed in
> NUTCH-69 ticket but no solution was made.
> 
> I don't want to make this message longer but imagine the following
> situation:
> I start crawling with three urls in mind like:
> url1 (contains 2 pages),
> url2 (50 pages),
> url3 (2000 pages)
> Now, when crawling is started with 3 threads then after url1 is
> crawled then one thread becomes redundant and error rate starts
> growing. After url2 is crawled (potentially not fully due to thread
> collision) there are three treads leaft for one huge url3 only. This
> means that I can't make url3 to be crawled fully because we are not
> able to avoid thread collision in spite of the fact three threads
> were
> needed at the beginning.
> 
> Anyway, thanks for answer!
> Lukas
> 
> On 9/14/05, Doug Cutting <cu...@nutch.org> wrote:
> > Lukas Vlcek wrote:
> > > 050913 113818 fetching http://xxxx_some_page_xxxx.html
> > > org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays:
> retry later.
> > 
> > This page will be retried the next time the fetcher is run.
> > 
> > This message means that this thread has waited http.max.delays
> times for
> > fetcher.server.delay seconds, and each time it found that another
> thread
> > was already accessing a page at this site.  To avoid these,
> increase
> > http.max.delays to a larger number, or, if you're crawling only
> servers
> > that you control, set fetcher.threads.per.host to something greater
> than
> > one (making the fetcher faster, but impolite).
> > 
> > Doug
> >
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server.
> Download it for free - -and be entered to win a 42" plasma tv or your
> very
> own Sony(tm)PSP.  Click here to play:
> http://sourceforge.net/geronimo.php
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>

Re: [nutch] - http.max.delays: retry later issue?

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi Doug,

I see what you mean and it meakes very sense now.

However, this leads me to the question what exactly
fetcher.threads.per.host value is use for? More specifically what
*host* means in Nutch configuration world?

Does it mean that if fetcher.threads.per.host value is set to 1 then
concurrent crawling of two documents from the same domain name is
forbidden (e.g.: http://www.abc.com/1.html and
http://www.abc.com/2.html) while in fact these two documents might be
physically located on two different servers without our knowledge?

On the other hand one physical server can be assigned multiple domain
names so crawling for http://www.abc.com/1.html and
http://www.xyz.com/1.html concurrently means that the same server
could be in charge.

When setting fetcher.threads.per.host value what should I have on my
mind: DNS domain name (meaning just $1 from
http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)?

Also I can see that this subject has been already discussed in
NUTCH-69 ticket but no solution was made.

I don't want to make this message longer but imagine the following situation:
I start crawling with three urls in mind like:
url1 (contains 2 pages),
url2 (50 pages),
url3 (2000 pages)
Now, when crawling is started with 3 threads then after url1 is
crawled then one thread becomes redundant and error rate starts
growing. After url2 is crawled (potentially not fully due to thread
collision) there are three treads leaft for one huge url3 only. This
means that I can't make url3 to be crawled fully because we are not
able to avoid thread collision in spite of the fact three threads were
needed at the beginning.

Anyway, thanks for answer!
Lukas

On 9/14/05, Doug Cutting <cu...@nutch.org> wrote:
> Lukas Vlcek wrote:
> > 050913 113818 fetching http://xxxx_some_page_xxxx.html
> > org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
> 
> This page will be retried the next time the fetcher is run.
> 
> This message means that this thread has waited http.max.delays times for
> fetcher.server.delay seconds, and each time it found that another thread
> was already accessing a page at this site.  To avoid these, increase
> http.max.delays to a larger number, or, if you're crawling only servers
> that you control, set fetcher.threads.per.host to something greater than
> one (making the fetcher faster, but impolite).
> 
> Doug
>

Re: [nutch] - http.max.delays: retry later issue?

Posted by Doug Cutting <cu...@nutch.org>.

Lukas Vlcek wrote:
> 050913 113818 fetching http://xxxx_some_page_xxxx.html
> org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.

This page will be retried the next time the fetcher is run.

This message means that this thread has waited http.max.delays times for 
fetcher.server.delay seconds, and each time it found that another thread 
was already accessing a page at this site.  To avoid these, increase 
http.max.delays to a larger number, or, if you're crawling only servers 
that you control, set fetcher.threads.per.host to something greater than 
one (making the fetcher faster, but impolite).

Doug