You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2009/02/28 20:06:12 UTC

[jira] Commented: (NUTCH-419) unavailable robots.txt kills fetch

    [ https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677704#action_12677704 ] 

Doug Cook commented on NUTCH-419:
---------------------------------

I ran into this same problem, and spent some time debugging it. Here's what's going on:

Symptom:
Slow sites, or sites that hang, are causing Fetcher threads to hang for very long periods of time.

I was running a fetcher with 400 threads, and found it getting slower and slower as the fetch progressed. A look at the stack  over time showed that over time, more and more threads blocked in the same place, in MultiThreadedHttpConnectionManager, waiting for a connection, called via http.getResponse() in getRobotRulesSet(). After hours of running, more than 390 of 400 threads were blocked! 

A little sleuthing revealed what's going on. This happens to be the one place in the code where we call http.getResponse() directly, instead of via getProtocolOutput, which does its own connection throttling, so we're relying on MultiThreadedHttpConnectionManager's connection throttling. The problem was that MultiThreadedHttpConnectionManager was ignoring the connection timeout, and waiting indefinitely for any other running threads to release a connection to the same site. If there are any large sites which time out on /robots.txt fetches, threads will "pile up" over time, since they're not timing out, and the RobotRules cache doesn't cache timeouts, so every new thread hitting that site will try to fetch /robots.txt, and hang for an increasing amount of time as it has to wait for the (ever-increasing # of) previous threads to give up and relinquish the single connection.

We seem to be victims of HttpClient's somewhat byzantine parameter architecture. We do set the timeout, but we set it on the MultiThreadedHttpConnectionManager, which ignores it. If I set the parameter on the HttpClient instead, the problem goes away. I haven't looked at the HttpClient source, but I'm guessing that what happens is that internally, HttpClient is calling MultiThreadedHttpConnectionManager.getConnectionWithTimeout or some such, thereby overriding the connection manager's own timeouts. At any rate, the fix is very straightforward, and a patch follows.

I highly recommend this patch, it hugely sped up my crawl, and I suspect it will do so for others with similar configurations and at least one slow or hung large-ish site in the crawl. 

There are two related issues: one, that the RobotRules cache doesn't keep track of failure cases, so sites that are down will be accessed many, many times; and two, that simultaneous /robots.txt accesses to the same site will all try to access the site. I've got a locally-modified version of getRobotRulesSet() that fixes these two problems as well. It's not perfect, but it makes the fetch both faster and more polite. I'll try to spiffy that up and include it in a future patch, if anyone's interested.

> unavailable robots.txt kills fetch
> ----------------------------------
>
>                 Key: NUTCH-419
>                 URL: https://issues.apache.org/jira/browse/NUTCH-419
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy, but I am pretty sure this is irrelevant. 
> Nutch in local mode, running on a linux machine with 2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments: last_robots.txt_requests_squidlog.txt, nutch-log.txt, squid_access_log_tail1000.txt
>
>
> I think there is another robots.txt-related problem which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads died
> while they were waiting for a robots.txt-file to be delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all fetch threads.
> It ends with many  HTTP-504-errors ("gateway timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
> These entries mean that it takes 15 minutes before the request ends
> with a timeout.
> This can be calculated from the squid log, the first column is the
> request  time (in UTC seconds), the second column is the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread tries to get this
> robots.txt-file the thread busy waits for the duration of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were caught in this trap
> at the time when  fetching was aborted, as there are 17 requests in
> the squid log which did not timeout before the message  "aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this robots.txt-url in
> the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.