You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "behnam nikbakht (Created) (JIRA)" <ji...@apache.org> on 2012/03/04 07:34:02 UTC

[jira] [Created] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

it is better for fetchItemQueues to select items from greater queues first
--------------------------------------------------------------------------

                 Key: NUTCH-1297
                 URL: https://issues.apache.org/jira/browse/NUTCH-1297
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.4
            Reporter: behnam nikbakht


there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority.
for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

Posted by "behnam nikbakht (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

behnam nikbakht updated NUTCH-1297:
-----------------------------------

    Attachment: NUTCH-1297.patch
    
> it is better for fetchItemQueues to select items from greater queues first
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1297
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1297
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1297.patch
>
>
> there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority.
> for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256531#comment-13256531 ] 

Julien Nioche commented on NUTCH-1297:
--------------------------------------

Hi Ferdy

Indeed, it is related but does not address the issue of prioritizing the queues. Had probably read the description too quickly, thanks for correcting me.

@Behnam : in the future please react to comments to your issues to make sure that your suggestions are understood correctly :-)
                
> it is better for fetchItemQueues to select items from greater queues first
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1297
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1297
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1297.patch
>
>
> there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority.
> for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

Posted by "behnam nikbakht (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

behnam nikbakht updated NUTCH-1297:
-----------------------------------

    Attachment: NUTCH-1297.patch
    
> it is better for fetchItemQueues to select items from greater queues first
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1297
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1297
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1297.patch
>
>
> there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority.
> for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256485#comment-13256485 ] 

Ferdy Galema commented on NUTCH-1297:
-------------------------------------

@Julien
Are you sure that property addresses the issue described by Behnam? It seems this is about giving priority to queues that have more items in them. For example when all queues are eligable for fetching, but there are less fetcher thread than queues, the best strategy is to first pick items from the biggest queues. It is a way to reduce a possible longtail.
                
> it is better for fetchItemQueues to select items from greater queues first
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1297
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1297
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1297.patch
>
>
> there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority.
> for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221908#comment-13221908 ] 

Julien Nioche commented on NUTCH-1297:
--------------------------------------

This can already be addressed by giving a larger value to this parameter 

{noformat} 
<property>
  <name>fetcher.queue.depth.multiplier</name>
  <value>50</value>
  <description>(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP]
  (see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter.
  A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list
  is not optimal.
  </description>
</property>
{noformat} 



                
> it is better for fetchItemQueues to select items from greater queues first
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1297
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1297
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1297.patch
>
>
> there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority.
> for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

Posted by "behnam nikbakht (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

behnam nikbakht updated NUTCH-1297:
-----------------------------------

    Attachment:     (was: NUTCH-1297.patch)
    
> it is better for fetchItemQueues to select items from greater queues first
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1297
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1297
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>
> there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority.
> for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira