You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Carsten Lehmann <ca...@googlemail.com> on 2006/12/21 11:40:29 UTC

unavailable robots.txt kills fetch (not NUTCH-344)

Dear List,

I think there is another robots.txt-related problem which is not
adressed by NUTCH-344,
but also results in an aborted fetch.

I am sure that in my last fetch all fetcher threads died
while they were waiting for a robots.txt-file to be delivered by a not
properly responding web server.

I looked at the squid access log, which is used by all fetch threads.
It ends with many  HTTP-504-errors ("gateway timeout") caused by a
certain robots.txt url:

<....>
1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html

These entries mean that it takes 15 minutes before the request ends
with a timeout.
This can be calculated from the squid log, the first column is the
request  time (in UTC seconds), the second column is the duration of
the request (in ms):
900000/1000/60=15 minutes.

As far as I understand it, every time a fetch thread tries to get this
robots.txt-file the thread busy waits for the duration of the request
(15 minutes).
If this is right, then all 17 fetcher threads were caught in this trap
at the time when  fetching was aborted, as there are 17 requests in
the squid log which did not timeout before the message  "aborting with
17 threads" was written to the nutch-logfile.

Setting fetcher.max.crawl.delay can not help here.
I see 296 access attempts in total concerning this robots.txt-url in
the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.

Are these assumptions correct? If so, should I open a Jira issue?

Thanks in advance,

Best regards
Carsten

Re: unavailable robots.txt kills fetch (not NUTCH-344)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Carsten Lehmann wrote:
> Dear List,
>
> I think there is another robots.txt-related problem which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
>
> I am sure that in my last fetch all fetcher threads died
> while they were waiting for a robots.txt-file to be delivered by a not
> properly responding web server.
>
> I looked at the squid access log, which is used by all fetch threads.
> It ends with many  HTTP-504-errors ("gateway timeout") caused by a
> certain robots.txt url:
>
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
>
> These entries mean that it takes 15 minutes before the request ends
> with a timeout.
> This can be calculated from the squid log, the first column is the
> request  time (in UTC seconds), the second column is the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
>
> As far as I understand it, every time a fetch thread tries to get this
> robots.txt-file the thread busy waits for the duration of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were caught in this trap
> at the time when  fetching was aborted, as there are 17 requests in
> the squid log which did not timeout before the message  "aborting with
> 17 threads" was written to the nutch-logfile.
>
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this robots.txt-url in
> the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.
>
> Are these assumptions correct? If so, should I open a Jira issue?

Please file a report, and most of all indicate which version of Nutch 
you are using (or SVN revision if it's not an official release).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com