You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Roger Dunk (JIRA)" <ji...@apache.org> on 2009/03/17 21:29:50 UTC

[jira] Created: (NUTCH-721) Fetcher2 Slow

Fetcher2 Slow
-------------

                 Key: NUTCH-721
                 URL: https://issues.apache.org/jira/browse/NUTCH-721
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.0.0
         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
            Reporter: Roger Dunk


Fetcher2 fetches far more slowly than Fetcher1.

Config options:
fetcher.threads.fetch = 80
fetcher.threads.per.host = 80
fetcher.server.delay = 0
generate.max.per.host = 1

With a queue size of ~40,000, the result is:

activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0

with maybe a download of 1 page per second.

Runing with -noParse makes little difference.

CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0

Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741225#action_12741225 ] 

Doğacan Güney commented on NUTCH-721:
-------------------------------------

Thanks for the analysis, Julien! Can you make a patch for the conf changes so we can commit it with your name?

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Roger Dunk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695600#action_12695600 ] 

Roger Dunk commented on NUTCH-721:
----------------------------------

Julien, yes, fetcher.threads.per.host.by.ip was set to false in the above tests. I have also tried it with true, which certainly didn't help the speed issue, but I can't comment on the hung threads as I didn't bother letting the fetch complete. I'd say there are two, likely unrelated problems with Fetcher2.

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-721) Fetcher2 Slow

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-721.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
         Assignee: Doğacan Güney

Code committed as of rev. 807485.

I am closing this issue. Of course, there may be other reasons why Fetcher2 is slow, so feel free to create new issues if so.

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>            Assignee: Doğacan Güney
>             Fix For: 1.1
>
>         Attachments: crawl_generate.tar.gz, NUTCH-721.patch, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712494#action_12712494 ] 

Otis Gospodnetic commented on NUTCH-721:
----------------------------------------

Ken's thoughts: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/


> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Roger Dunk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695298#action_12695298 ] 

Roger Dunk commented on NUTCH-721:
----------------------------------

I did a -"topN 5000", so only a subset of the attached, but still only 1 URL per host. The following is with 20 threads and also no parsing.

[root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655 -threads 20 -noParsing

[...]

Aborting with 20 hung threads.
Fetcher: done

real    60m14.926s
user    0m38.671s
sys     0m6.134s

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712492#action_12712492 ] 

Otis Gospodnetic commented on NUTCH-721:
----------------------------------------

Questions:
Has anyone tried profiling this? (may be relevant: http://markmail.org/message/4ixrnvfycpgmkdno )

Or maybe simply debugged/timed various blocks of code using something as simple as print statements and simple timers?

Or maybe running just a single thread and then doing kill -QUIT a number of times to simply try and spot the method where the code seems to spend a lot of its time?


> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-721) Fetcher2 Slow

Posted by "Roger Dunk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roger Dunk updated NUTCH-721:
-----------------------------

    Attachment: nutch-site.xml
                crawl_generate.tar.gz

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694708#action_12694708 ] 

Doğacan Güney commented on NUTCH-721:
-------------------------------------

OK, there is clearly a problem with the new fetcher. 

First, let's make sure that there is indeed a problem with the new fetcher and this is not the side effect of some other code we introduced between 0.9 and 1.0. So I suggest that we re-commit old fetcher back into trunk and do a side-by-side comparison to make sure that the problem is with the new fetcher. 

If it is with the new fetcher, then we may try to salvage Todd's work (I remember that he said that his fetcher was faster, right?).

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-721) Fetcher2 Slow

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-721:
--------------------------------

    Attachment: NUTCH-721.patch

Sets the default value for fetcher.threads.per.host.by.ip to false

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, NUTCH-721.patch, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694986#action_12694986 ] 

Doğacan Güney commented on NUTCH-721:
-------------------------------------

I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher?

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-721) Fetcher2 Slow

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694986#action_12694986 ] 

Doğacan Güney edited comment on NUTCH-721 at 4/2/09 6:01 AM:
-------------------------------------------------------------

I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher so that we can find out if this is related to new fetcher or is the side effect of some other change?

      was (Author: dogacan):
    I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher?
  
> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Steven Denny (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730242#action_12730242 ] 

Steven Denny commented on NUTCH-721:
------------------------------------

I've done some testing on this and looked at the number of pages being fed, as this obvioulsy limits the number of pages you can fetch:

2009-07-10 04:01:35,296 INFO fetcher.Fetcher - Fed 500 urls in 186 secs = 2.7url/s
2009-07-10 04:04:18,343 INFO fetcher.Fetcher - Fed 499 urls in 163 secs = 3.1url/s
2009-07-10 04:06:57,109 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s
2009-07-10 04:10:38,282 INFO fetcher.Fetcher - Fed 499 urls in 221 secs = 2.3url/s
2009-07-10 04:12:58,371 INFO fetcher.Fetcher - Fed 498 urls in 140 secs = 3.6url/s
2009-07-10 04:16:12,275 INFO fetcher.Fetcher - Fed 499 urls in 193 secs = 2.6url/s
2009-07-10 04:19:20,162 INFO fetcher.Fetcher - Fed 499 urls in 187 secs = 2.7url/s
2009-07-10 04:21:25,846 INFO fetcher.Fetcher - Fed 499 urls in 125 secs = 4.0url/s
2009-07-10 04:24:16,049 INFO fetcher.Fetcher - Fed 495 urls in 170 secs = 2.9url/s
2009-07-10 04:27:01,944 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s
2009-07-10 04:29:26,247 INFO fetcher.Fetcher - Fed 499 urls in 144 secs = 3.5url/s
2009-07-10 04:32:02,590 INFO fetcher.Fetcher - Fed 499 urls in 156 secs = 3.2url/s
2009-07-10 04:34:49,985 INFO fetcher.Fetcher - Fed 498 urls in 167 secs = 3.0url/s
2009-07-10 04:37:28,367 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s
2009-07-10 04:40:09,865 INFO fetcher.Fetcher - Fed 499 urls in 161 secs = 3.1url/s
2009-07-10 04:42:55,203 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s

That test was one a crawldb just initialised with 11,000 urls (unique sites).

However, on the next iteration where I'm feeding urls from non-unique sites, I see 5-7 times that rate. (My test system is a vm on our network, with nothing special in terms of DNS. Someof the look ups were taking 5-6 seconds).

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Roger Dunk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712506#action_12712506 ] 

Roger Dunk commented on NUTCH-721:
----------------------------------

My tests were done on a segment with only 1 URL per host (generate.max.per.host = 1), so I don't believe what Ken has to say is the reason, at least in my case, for Fetcher2 performing slowly.

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695394#action_12695394 ] 

Julien Nioche commented on NUTCH-721:
-------------------------------------

The message about the "Aborted hung threads" looks like what I described in https://issues.apache.org/jira/browse/NUTCH-719 except that in this case there are active queues but fetchQueues.totalSize=0 

Roger : can you confirm that the parameter fetcher.threads.per.host.by.ip is set to false?



> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Roger Dunk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695170#action_12695170 ] 

Roger Dunk commented on NUTCH-721:
----------------------------------

For the following tests I've used the same segment containing 5000 URLs. I cleaned the named cache before the first two tests.

[root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.OldFetcher newcrawl/segments/20090402130655/

real    3m38.084s
user    2m20.887s
sys     0m7.470s

[root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/

[...]

Fetcher: done

real    53m44.800s
user    2m20.070s
sys     0m9.527s

For this next test, I used the same segment but didn't clear the named cache from the previous test, so all resolvable hosts should still be cached. This appeared to help greatly, as often times out of 80 active threads, only 60 were spinwaiting (as opposed to 79 in the non-cached test), but there were still plenty of times where at least 30 consecutive log entries showed 80 threads spinwaiting. And clearly as can be seen from the times below, still nowhere in the league of OldFetcher.

[root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/

[...]

Aborting with 80 hung threads.
Fetcher: done

real    22m5.420s
user    2m39.407s
sys     0m8.192s

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741092#action_12741092 ] 

Andrzej Bialecki  commented on NUTCH-721:
-----------------------------------------

+1. Current defaults are sub-optimal due to backward-compatibility issues with early Nutch 0.8. This should be no longer a concern.

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741082#action_12741082 ] 

Julien Nioche commented on NUTCH-721:
-------------------------------------

I had another look at this issue after applying the patch from Nutch-719. I can easily reproduce the situation from the original post by setting fetcher.threads.per.host.by.ip to true. The nutch-site file sent by Rodger does not specify it so it would rely on this value by default. Once setting it to false all threads are active and the fetching is much faster. 

I have used the first 5K URLs from the fetchlist sent by Rodger and compared the perfs with by.ip set to false :  

OldFetcher :  
real	32m26.003s
user	1m11.768s
sys	0m10.337s

OldFetcher :  
real	30m52.965s
user	1m10.696s
sys	0m10.425s

Fetcher :  
real	31m21.924s
user	1m12.725s
sys	0m10.797s

Fetcher :
real	30m3.017s
user	1m15.509s
sys	0m10.909s

I ran each step twice and as we can see the results are comparable.

This explanation is also compliant with Steven's observation that we get 5-7 times the rate as we would hit the DNS cache for subsequent calls for URLs from non unique sites. The IP resolution is done by the QueueFeeder which explains why it is slowing down the number of URLs being available for fetching.

I don't think that the oldFetcher allows to group URLs by IP for politeness in which case why not making fetcher.threads.per.host.by.ip default to false in the new fetcher?


> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695233#action_12695233 ] 

Hudson commented on NUTCH-721:
------------------------------

Integrated in Nutch-trunk #772 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/772/])
     - Commit old fetcher as OldFetcher for now so that we can test Fetcher2 performance.


> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695277#action_12695277 ] 

Doğacan Güney commented on NUTCH-721:
-------------------------------------

Wow, 53 min vs 3 min !?

Thanks a lot for testing and that is indeed very worrying.

Which 5000 url set did you use? I think the crawl_generate you attached to this issue has 13K urls?

PS: One small thing new Fetcher requires less threads than OldFetcher. If you have time can you try with
smaller number of threads (say, 15-20)?

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-721) Fetcher2 Slow

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730402#action_12730402 ] 

Doğacan Güney commented on NUTCH-721:
-------------------------------------

Steven, if you have time/hardware, can you retry your use-case with OIdFetcher in trunk?

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.