You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2009/04/01 22:01:12 UTC

[jira] Commented: (NUTCH-721) Fetcher2 Slow

    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694708#action_12694708 ] 

Doğacan Güney commented on NUTCH-721:
-------------------------------------

OK, there is clearly a problem with the new fetcher. 

First, let's make sure that there is indeed a problem with the new fetcher and this is not the side effect of some other code we introduced between 0.9 and 1.0. So I suggest that we re-commit old fetcher back into trunk and do a side-by-side comparison to make sure that the problem is with the new fetcher. 

If it is with the new fetcher, then we may try to salvage Todd's work (I remember that he said that his fetcher was faster, right?).

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.