You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Chris Schneider (JIRA)" <ji...@apache.org> on 2006/10/11 20:40:34 UTC

[jira] Created: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

Server delay feature conflicts with maxThreadsPerHost
-----------------------------------------------------

Key: NUTCH-385
URL: http://issues.apache.org/jira/browse/NUTCH-385
Project: Nutch
Issue Type: Bug
Components: fetcher
Reporter: Chris Schneider

For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host:

1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay.

2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1.

According to my (limited) understanding of the code in HttpBase.java:

Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously.

Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host.

I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters.

In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host?

It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441552 ] 
            
Doug Cutting commented on NUTCH-385:
------------------------------------

> It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped the server delay [...]

Are any other alternatives useful?  I've always assumed that fetcher.threads.per.host was only useful when server.delay is zero, to speed things when crawling servers that you control.  The concerns are:

1. Being polite.  The standard interpretation of "crawl-delay" is the pause between successive, single-threaded access.  So I don't think we could ever call ourselves polite when threads.per.host > 1.

2. Crawling quickly.  For sites that are under the control of the crawl operator, or are known not to care, one can ignore politeness.  In these cases multiple threads are warranted, and delays are not.

I don't see an interesting middle ground.


> Server delay feature conflicts with maxThreadsPerHost
> -----------------------------------------------------
>
>                 Key: NUTCH-385
>                 URL: http://issues.apache.org/jira/browse/NUTCH-385
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Chris Schneider
>
> For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host.
> I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

Posted by "Chris Schneider (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441528 ] 
            
Chris Schneider commented on NUTCH-385:
---------------------------------------

This comment was actually made by Andrzej in response to an email containing the analysis above that I sent him before creating this JIRA issue:

Let's start with defining what is the desired semantics of these two parameters together. In my opinion it's the following:

* if only 1 thread per host is allowed, at any given moment at most one thread should be accessing the host, and the interval between consecutive requests should be at least crawlDelay (whichever way we determine this value - from config, from robots.txt or external sources such as partner agreements).

* if two or more (for example N) threads per host are allowed, at any given moment at most N threads should be accessing the host, and the interval between consecutive requests should be at least crawlDelay - that is, the interval between when one of the threads finishes, and another starts requesting.

I.e.: for threads.per.host=2 and crawlDelay=3 seconds, if we start 3 threads trying to access the same host we should get something like this (time in [s] on the x axis, # - start request, + - request in progress, b - blocked in per-host limit, c - obeying crawlDelay):

===0         1         2
===01234567890123456789012345678
1: #+++cccbbccc#++++cccbb#++++++
2: #++++++++cccbcccbcc#+++cccccb
3: bbbbccc#+++++ccc#+++++ccc#+++

As you can see, at any given time we have at most 2 threads accessing the site, and the interval between consecutive requests is at least 3 seconds. Especially interesting in the above graph is the period between 17-18 seconds - thread 2 had to be delayed additional 2 seconds to satisfy the crawl delay requirement, even though the threads.per.host requirement was satisfied.

[snip]

It's a question of priorities - in the model I drafted above the topmost priority is the observance of crawlDelay, sometimes at the cost of the number of concurrent threads (see seconds 17-18). In this model, the code should always put the delay in BLOCKED_ADDR_TO_TIME, in order to wait at least crawlDelay after _any_ thread finishes. We could use an alternative model, where crawlDelay is measured from the start of the request, and not from the end - see the graph below:

===0         1         2         3
===01234567890123456789012345678901234567
1: #+++cccccbbb#++++cccc#++++++cc#+++++++
2: ccc#++++++++cccccc#+++ccccc#++++c#++++
3: cccccc#+++++ccc#+++++ccc#+++++ccccccbb

but it seems to me that it's more complicated, gives less requests/sec, and the interpretaion of crawlDelay's meaning is stretched ...

[snip]

> Server delay feature conflicts with maxThreadsPerHost
> -----------------------------------------------------
>
>                 Key: NUTCH-385
>                 URL: http://issues.apache.org/jira/browse/NUTCH-385
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Chris Schneider
>
> For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host.
> I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

Posted by "Mike Baranczak (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794757#action_12794757 ] 

Mike Baranczak commented on NUTCH-385:
--------------------------------------

This is something that recently came up on a project that I'm working on (we're using 1.0). I'd actually be OK with leaving the functionality as it is - as long as it was explained properly in the config file. That is, make it clear that fetcher.server.delay is applied to each fetcher thread individually.

> Server delay feature conflicts with maxThreadsPerHost
> -----------------------------------------------------
>
>                 Key: NUTCH-385
>                 URL: https://issues.apache.org/jira/browse/NUTCH-385
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Chris Schneider
>
> For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host.
> I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

Posted by "Chris Schneider (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441529 ] 
            
Chris Schneider commented on NUTCH-385:
---------------------------------------

This comment was actually made by Ken Krugler, who was responding to Andrzej's comment above:

[with respect to Andrzej's definitions at the beginning of his comment - Ed.:]
I agree that this is one of two possible interpretations. The other is that there are N "virtual users", and there crawlDelay applies to each of these virtual users in isolation.

Using the same type of request data from above, I see a queue of requests with the following durations (in seconds):

4, 9, 6, 5, 6, 4, 7, 4

So with the virtual user model (where N = 2, thus "A" and "B" users), I get:

===0         1         2
===01234567890123456789012345678
A: 4+++ccc6+++++ccc6+++++ccc7++++++
B: 9++++++++ccc5++++ccc4+++ccc4+++

The numbers mark the start of each new request, and the total duration for the request.

This would seem to be less efficient than your approach, but somehow feels more in the nature of what threads.per.host really means.

Let's see, for N = 3 this would look like:

===0         1         2
===01234567890123456789012345678
A: 4+++ccc5++++ccc7++++++ccc
B: 9++++++++ccc4+++ccc
C: 6+++++ccc6+++++ccc4+++ccc

[snip]

To implement the virtual users model, each unique domain being actively fetched from would need to have N bits of state tracking the time of completion of the last request.

Anyway, just an alternative interpretation...


> Server delay feature conflicts with maxThreadsPerHost
> -----------------------------------------------------
>
>                 Key: NUTCH-385
>                 URL: http://issues.apache.org/jira/browse/NUTCH-385
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Chris Schneider
>
> For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host.
> I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira