You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Carsten Lehmann (JIRA)" <ji...@apache.org> on 2006/12/24 14:26:22 UTC

[jira] Commented: (NUTCH-419) unavailable robots.txt kills fetch

    [ http://issues.apache.org/jira/browse/NUTCH-419?page=comments#action_12460696 ] 
            
Carsten Lehmann commented on NUTCH-419:
---------------------------------------

Some more explanations:

Above I meant http://gso.gbv.de/XYZ, not http://XYZ.gso.gbv.de of course.


I have attached two other log extracts:

a) squid_access_log_tail1000.txt 

this file contains the last 1000 lines of the squid access log.
it shows what the fetcher has actually been doing before the fetch gets aborted.
It ends with a number of requests to that certain robots.txt-url.

b)  last_robots.txt_requests_squidlog.txt

this files shows the last requests to that certain robot.txt-url. 

it might be of concern that near the end of this file the line
1166652145.652 1042451 127.0.0.1 TCP_MISS/504 1450 GET http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
repeats 14 times.
this means that there have been 14 simultaenous requests to this url,
right? 
are requests to the robots.txt-file not included in "fetcher.server.delay", which is set to "2.0" in
my configuration?
anyway, this seems to be ill behaviour.

> unavailable robots.txt kills fetch
> ----------------------------------
>
>                 Key: NUTCH-419
>                 URL: http://issues.apache.org/jira/browse/NUTCH-419
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy, but I am pretty sure this is irrelevant. 
> Nutch in local mode, running on a linux machine with 2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments: last_robots.txt_requests_squidlog.txt, nutch-log.txt, squid_access_log_tail1000.txt
>
>
> I think there is another robots.txt-related problem which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads died
> while they were waiting for a robots.txt-file to be delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all fetch threads.
> It ends with many  HTTP-504-errors ("gateway timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> These entries mean that it takes 15 minutes before the request ends
> with a timeout.
> This can be calculated from the squid log, the first column is the
> request  time (in UTC seconds), the second column is the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread tries to get this
> robots.txt-file the thread busy waits for the duration of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were caught in this trap
> at the time when  fetching was aborted, as there are 17 requests in
> the squid log which did not timeout before the message  "aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this robots.txt-url in
> the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira