You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2014/04/05 10:31:16 UTC

[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

    [ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961009#comment-13961009 ] 

Julien Nioche commented on NUTCH-1687:
--------------------------------------

I like the idea but am a bit concerned by the potential impact of : 

it = Iterables.cycle(queues.keySet()).iterator();

whenever a new FetchItemQueue is added. It will be called a lot at the beginning of a Fetch when we create most of the queues and we'd create loads of iterator that would be overridden straight away.

What about doing this lazily and trigger the generation of a new iterator only if getFetchItem() is called and at least one FetchItemQueue has been added? 

I agree that in the middle of a Fetch, queues don't get added so often compared to calls to getFetchItem() so not having to create an iterator there as we currently do would definitely be a plus.

In extreme cases when there is a large diversity of hostnames / domains within a fetchlist we could end up creating a new iterator for every new URL and would always start at the first one anyway which is what we currently do so the new approach would not be worse anyway.

What do you think?

Also why not using Iterators.cycle() directly? 

Thanks

> Pick queue in Round Robin
> -------------------------
>
>                 Key: NUTCH-1687
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1687
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Tien Nguyen Manh
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at the start of list have more change to be pick first, that can cause problem of long tail queue, which only few queue available at the end which have many urls.
> public synchronized FetchItem getFetchItem() {
>       final Iterator<Map.Entry<String, FetchItemQueue>> it =
>         queues.entrySet().iterator(); ==> always reset to find queue from start
>       while (it.hasNext()) {
> ....
> I think it is better to pick queue in round robin, that can make reduce time to find the available queue and make all queue was picked in round robin and if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.2#6252)