You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2014/04/17 16:35:15 UTC

[jira] [Commented] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

    [ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972984#comment-13972984 ] 

Julien Nioche commented on NUTCH-207:
-------------------------------------

Am starting to think that the cleanest way to implement this would be to make some radical changes to the way the Fetcher works and use the Executor framework. The ThreadPoolExecutor is quite a nice fit for that as it defines a max number of threads to use but would require changing the logic in the Fetcher and get the queues to push the tasks to the Executor instead of having the FetcherThreads polling them for work. Will probably open a new issue for this. 

> Bandwidth target for fetcher rather than a thread count
> -------------------------------------------------------
>
>                 Key: NUTCH-207
>                 URL: https://issues.apache.org/jira/browse/NUTCH-207
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Rod Taylor
>            Assignee: Julien Nioche
>             Fix For: 1.9
>
>         Attachments: ratelimit.patch
>
>
> Increases or decreases the number of threads from the starting value (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve a target bandwidth (fetcher.threads.bandwidth).
> It seems to be able to keep within 10% of the target bandwidth even when large numbers of errors are found or when a number of large pages is run across.
> To achieve more accurate tracking Nutch should keep track of protocol overhead as well as the volume of pages downloaded.



--
This message was sent by Atlassian JIRA
(v6.2#6252)