You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "behnam nikbakht (Created) (JIRA)" <ji...@apache.org> on 2012/02/15 07:08:03 UTC

[jira] [Created] (NUTCH-1278) Fetch Improvement in threads per host

Fetch Improvement in threads per host
-------------------------------------

                 Key: NUTCH-1278
                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
             Project: Nutch
          Issue Type: New Feature
          Components: fetcher
    Affects Versions: 1.4
            Reporter: behnam nikbakht


the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "behnam nikbakht (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

behnam nikbakht updated NUTCH-1278:
-----------------------------------

    Attachment: NUTCH-1278.zip
    
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208413#comment-13208413 ] 

Lewis John McGibbney commented on NUTCH-1278:
---------------------------------------------

Hi Behnam. Do you have a patch for trunk? Thank you
                
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "behnam nikbakht (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221568#comment-13221568 ] 

behnam nikbakht commented on NUTCH-1278:
----------------------------------------

i edit this patch to make this changes:
use a class named HostsUtil that manage a xml file, named hosts_conf.xml for maintaining hosts permanent and temporal informations in a multi thread environment. this class maintain some of variables like timeout for fetch and hostcount for generate , ...
this variables can used in fetch and generate and other parts of Nutch.
for adaptive http.timeout in fetch, we simply change some parts of Fetcher.java and some changes in Protocol.java and it's implementations without disturb them.

                
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "behnam nikbakht (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

behnam nikbakht updated NUTCH-1278:
-----------------------------------

    Attachment: NUTCH-1278-v.2.zip
    
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208417#comment-13208417 ] 

Julien Nioche commented on NUTCH-1278:
--------------------------------------

yep, please use trunk as a basis for your contribution. The parameter you mention does not exist any more and has been replaced with 'fetcher.threads.per.queue'

                
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "behnam nikbakht (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211294#comment-13211294 ] 

behnam nikbakht commented on NUTCH-1278:
----------------------------------------

here is a primary patch, that has some changes in Fetcher.java ,Protocol.java and it's plugins like lib-http
i use a file in local system for maintaining a hashtable that contains hosts and their http.timeout
for each blocked response, there is a increment in timeout and for each success, there is a decrement
we can use different increment and decrement rates so we can make a balance between total time of fetch Job, and a relation between fetched and blocked rates. for example it can configurable that if 90% of requests for some host are seccess, there is no need to increase timeout.
                
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225545#comment-13225545 ] 

Ferdy Galema commented on NUTCH-1278:
-------------------------------------

I noticed you used the diff command this time, but failed to include the new file in patch. When you want the diff command to include new files, you simply add them first to svn. In the case of HostsUtil, this would be:

svn add src/java/org/apache/nutch/util/HostsUtil.java

When you execute the diff command afterwards, you will notice that it included the new file. Now you can simply upload this patch file only instead of a zip.

Good luck.
                
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211320#comment-13211320 ] 

Lewis John McGibbney commented on NUTCH-1278:
---------------------------------------------

Behnam, this looks interesting but there are a few problems here.
1) It would be much much easier for us to apply, test and comment on your contribution if you included it in a simple .patch file. This can be done like so 
{code}
$ cd $NUTCH_HOME
$ svn diff > NUTCH-patch-name.patch
{code}
The current zip format for the patch(es), plus the fact that every class has been patched separately from thier own respective directories makes it really hard for us to work with this.
2) I doesn't appear that this patch is actually applies against trunk? Maybe 1.4? You can check out trunk here [1] I'm getting errors when trying to apply HttpBase then gave up and started writing this.
3) for a change to the fetcher of this scale, it would be really nice if you could provide a test within the test suite we already maintain [2].

As I said this looks really great, and sorry for the rather lengthy initial response, but for us to consider this for integration it would be great for your contributions to meet this minimum requirement as they are highly appreciated. Thank you

[1] https://svn.apache.org/repos/asf/nutch/trunk/
[2] https://svn.apache.org/viewvc/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java?view=markup    
                
> Fetch Improvement in threads per host
> -------------------------------------
>
>                 Key: NUTCH-1278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1278
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host
> there is a possibility with using of dynamic values for every host that influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira