You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2014/06/27 09:49:25 UTC

[jira] [Resolved] (NUTCH-385) Improve description of thread related configuration for Fetcher

     [ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-385.
---------------------------------

    Resolution: Fixed

trunk : Committed revision 1605978.
2.x : Committed revision 1605979.

Thanks Lufeng, I've modified the description as per your comment.

bq. Another issue is that I think this property "fetcher.max.crawl.delay" is not uniform with "fetcher.server.delay" and "fetcher.server.min.delay". It is changed to "fetcher.server.max.delay" more suitable?

Let's track this in a separate issue - we'd need to make the changes backwards compatible etc...  "fetcher.robots.max.delay" could be a more accurate name for it as it applies to the values coming from the robots directives only.

Thanks!

 



> Improve description of thread related configuration for Fetcher
> ---------------------------------------------------------------
>
>                 Key: NUTCH-385
>                 URL: https://issues.apache.org/jira/browse/NUTCH-385
>             Project: Nutch
>          Issue Type: Bug
>          Components: documentation, fetcher
>            Reporter: Chris Schneider
>            Assignee: Julien Nioche
>             Fix For: 1.9
>
>         Attachments: NUTCH-385.patch
>
>
> For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host.
> I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)