You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/04/25 14:08:15 UTC

[jira] Created: (NUTCH-475) Adaptive crawl delay

Adaptive crawl delay
--------------------

                 Key: NUTCH-475
                 URL: https://issues.apache.org/jira/browse/NUTCH-475
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Doğacan Güney
             Fix For: 1.0.0


Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-475) Adaptive crawl delay

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491882 ] 

Enis Soztutar commented on NUTCH-475:
-------------------------------------

we can use a formula like : 

delay = alpha * delay + (1 - alpha) * (k * t)

where 0 < alpha <= 1

so that the waiting time is less sensitive to varying reply times of the server. 


> Adaptive crawl delay
> --------------------
>
>                 Key: NUTCH-475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-475
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-475) Adaptive crawl delay

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-475:
------------------------------------

    Fix Version/s:     (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Adaptive crawl delay
> --------------------
>
>                 Key: NUTCH-475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-475
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>         Attachments: adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-475) Adaptive crawl delay

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-475:
--------------------------------

    Attachment: adaptive-delay_draft.patch

Patch with a simple adaptive algorithm. It measures the last response time of the server (say t), then waits at least k * t (where k, by default, is 10) before making a new request. There are also lower and upper bounds for the wait interval (so the fetcher will wait at least a predetermined value even if k * t is smaller than that).

Note that this is only a draft, and has some rough edges. It updates Fetcher2 code(but not Fetcher code), so that one can benchmark it against regular Fetcher to see the difference.

> Adaptive crawl delay
> --------------------
>
>                 Key: NUTCH-475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-475
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-475) Adaptive crawl delay

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660383#action_12660383 ] 

Todd Lipcon commented on NUTCH-475:
-----------------------------------

Implemented this in NUTCH-669 as well using a similar formula to what Enis suggests. I used different values of alpha for increasing vs decreasing requests times so that there is less momentum when the server is slowing down.

> Adaptive crawl delay
> --------------------
>
>                 Key: NUTCH-475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-475
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.