You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/04/25 14:10:15 UTC

[jira] Updated: (NUTCH-475) Adaptive crawl delay

     [ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-475:
--------------------------------

    Attachment: adaptive-delay_draft.patch

Patch with a simple adaptive algorithm. It measures the last response time of the server (say t), then waits at least k * t (where k, by default, is 10) before making a new request. There are also lower and upper bounds for the wait interval (so the fetcher will wait at least a predetermined value even if k * t is smaller than that).

Note that this is only a draft, and has some rough edges. It updates Fetcher2 code(but not Fetcher code), so that one can benchmark it against regular Fetcher to see the difference.

> Adaptive crawl delay
> --------------------
>
>                 Key: NUTCH-475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-475
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.