You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roger Dunk <ro...@at.com.au> on 2009/02/03 04:10:13 UTC

Fetcher2 Slow

Hi all,

I'm having no luck whatsoever using Fetcher2, as even with 50 threads enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 hosts in the queue. I do however have some 50,000 pages to fetch, the majority of which are from unique hosts.

The regular fetcher works as expected, fetching concurrently from 50 hosts.

I'm using the current SVN. Any thoughts?

Cheers...
Roger

Re: Fetcher2 Slow

Posted by Roger Dunk <ro...@at.com.au>.
Fetcher2 from 0.9 was renamed to Fetcher in 1.0. In both versions it runs 
more slowly for me than the original fetcher. There's no solution yet that 
I'm aware of.

Cheers...
Roger

--------------------------------------------------
From: "askNutch" <he...@126.com>
Sent: Wednesday, May 06, 2009 11:28 AM
To: <nu...@lucene.apache.org>
Subject: Re: Fetcher2 Slow

>
> hi Roger :
> you use new fecther with 1.0 or fetcher2 with 0.9 ? whicth is very slow?
> and the problem is solved?
> thanks!
>
>
>
> -- 
> View this message in context: 
> http://www.nabble.com/Fetcher2-Slow-tp21803185p23398585.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Re: Fetcher2 Slow

Posted by Raymond Balmès <ra...@gmail.com>.
Not sure if that problem is solved, I have it and reported it in a previous
thread. Extremely fast fetch at the beginning and damn slow fetches after a
while.

Today I started a recrawl using the Wiki script.... I get a steady fetch
speed for about 1 hour so looks good. The only difference maybe this time is
that I did NOT use the TopN option ? Hope this helps finding what's going
wrong/strange.

-Ray-


2009/5/6 askNutch <he...@126.com>

>
> hi Roger :
>  you use new fecther with 1.0 or fetcher2 with 0.9 ? whicth is very slow?
>  and the problem is solved?
> thanks!
>
>
>
> --
> View this message in context:
> http://www.nabble.com/Fetcher2-Slow-tp21803185p23398585.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Fetcher2 Slow

Posted by askNutch <he...@126.com>.
hi Roger :
 you use new fecther with 1.0 or fetcher2 with 0.9 ? whicth is very slow?
 and the problem is solved?
thanks!



-- 
View this message in context: http://www.nabble.com/Fetcher2-Slow-tp21803185p23398585.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Fetcher2 Slow

Posted by Roger Dunk <ro...@at.com.au>.
Hi Sami,

The machine has direct connectivity -- no NAT, and is not running IPV6.

Cheers...
Roger

--------------------------------------------------
From: "Sami Siren" <ss...@gmail.com>
Sent: Monday, March 30, 2009 5:42 PM
To: <nu...@lucene.apache.org>
Subject: Re: Fetcher2 Slow

> Roger Dunk wrote:
>> Andrzej stated in NUTCH-669 that "some people reported performance issues 
>> with Fetcher2, i.e. that it doesn't use the available bandwidth. These 
>> reports are unconfirmed, and they may have been caused by suboptimal URL 
>> / host distribution in a fetchlist - but it would be good to review the 
>> synchronization and threading aspects of Fetcher2."
>>
>> To address this, I've tried just now generating a fetchlist using 
>> generate.max.per.host = 1 (which gave me 35,000 unique hosts) to 
>> guarantee unique hosts, but the problem still remains.
>>
>> Therefore, I believe it's clearly not an issue of suboptimal URL / host 
>> distribution. If you require any further information to confirm my 
>> report, you need only ask!
>
>
> I have so far seen two sources for slowness, don't know it they are 
> related to your case:
>
> 1. You are using nutch from behind nat box. I experienced this problem 
> when I did some test crawling from a machine sitting behind adsl router 
> that did NAT. Soon after starting a crawl the maximum number of NAT 
> connections was reached in the router and furter connections could only be 
> made after old ones timeouted from NAT table. These connections were 
> mostly DNS connections.
>
> 2. Your machine has ip6 enabled. This I noticed more recently when I was 
> wondering relatively slow fetching speed on a box. After disabling ipv6 
> totally I was able to fetch 2-4 times faster without any other config 
> changes.
>
> --
>  Sami Siren
> 

Re: Fetcher2 Slow

Posted by Sami Siren <ss...@gmail.com>.
Roger Dunk wrote:
> Andrzej stated in NUTCH-669 that "some people reported performance 
> issues with Fetcher2, i.e. that it doesn't use the available bandwidth. 
> These reports are unconfirmed, and they may have been caused by 
> suboptimal URL / host distribution in a fetchlist - but it would be good 
> to review the synchronization and threading aspects of Fetcher2."
> 
> To address this, I've tried just now generating a fetchlist using 
> generate.max.per.host = 1 (which gave me 35,000 unique hosts) to 
> guarantee unique hosts, but the problem still remains.
> 
> Therefore, I believe it's clearly not an issue of suboptimal URL / host 
> distribution. If you require any further information to confirm my 
> report, you need only ask!


I have so far seen two sources for slowness, don't know it they are 
related to your case:

1. You are using nutch from behind nat box. I experienced this problem 
when I did some test crawling from a machine sitting behind adsl router 
that did NAT. Soon after starting a crawl the maximum number of NAT 
connections was reached in the router and furter connections could only 
be made after old ones timeouted from NAT table. These connections were 
mostly DNS connections.

2. Your machine has ip6 enabled. This I noticed more recently when I was 
wondering relatively slow fetching speed on a box. After disabling ipv6 
totally I was able to fetch 2-4 times faster without any other config 
changes.

--
  Sami Siren


Re: Fetcher2 Slow

Posted by Andrzej Bialecki <ab...@getopt.org>.
Roger Dunk wrote:
> Andrzej stated in NUTCH-669 that "some people reported performance 
> issues with Fetcher2, i.e. that it doesn't use the available bandwidth. 
> These reports are unconfirmed, and they may have been caused by 
> suboptimal URL / host distribution in a fetchlist - but it would be good 
> to review the synchronization and threading aspects of Fetcher2."
> 
> To address this, I've tried just now generating a fetchlist using 
> generate.max.per.host = 1 (which gave me 35,000 unique hosts) to 
> guarantee unique hosts, but the problem still remains.
> 
> Therefore, I believe it's clearly not an issue of suboptimal URL / host 
> distribution. If you require any further information to confirm my 
> report, you need only ask!

Thanks for reporting this. Yes, we need more information - it's best if 
you create a JIRA issue, because then it will be easier to send attachments.

What we need at this moment is:

* the fetchlist - just zip the crawl_generate and attach it.
* nutch-site.xml and hadoop-site.xml (if you run in a distributed mode).
* cmd-line parameters, specifically the number of threads and -noParsing
* information about your environment (OS, cpu/mem, heapsize, JVM version).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Fetcher2 Slow

Posted by Roger Dunk <ro...@at.com.au>.
Andrzej stated in NUTCH-669 that "some people reported performance issues 
with Fetcher2, i.e. that it doesn't use the available bandwidth. These 
reports are unconfirmed, and they may have been caused by suboptimal URL / 
host distribution in a fetchlist - but it would be good to review the 
synchronization and threading aspects of Fetcher2."

To address this, I've tried just now generating a fetchlist using 
generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee 
unique hosts, but the problem still remains.

Therefore, I believe it's clearly not an issue of suboptimal URL / host 
distribution. If you require any further information to confirm my report, 
you need only ask!

Cheers...
Roger

--------------------------------------------------
From: "Roger Dunk" <ro...@at.com.au>
Sent: Tuesday, March 17, 2009 7:10 PM
To: <nu...@lucene.apache.org>
Subject: Re: Fetcher2 Slow

> Now that the soon to be released v1 uses Fetcher2 as default (or as the 
> only fetcher available?), I would think that this slowness problem that is 
> facing a number of users might be addressed?
>
> In short the case for me is like this:
>
> Nutch trunk revision 755143
> JDK 1.6_12 on Linux
>
> Crawl list consists of ~40,000 URLs from dmoz, so naturally are well 
> distributed among hosts (i.e. mostly unique hosts).
>
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
>
> The result?
>
> Most of the time, something like this:
>
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
>
> If I'm lucky, it might fetch around 1 page per second (or less).
>
> What I have noticed is that if I let it run for a while, cancel the fetch, 
> and start it again from the beginning, it runs very quickly for a while 
> before it slows right down to a trickle again. My guess is that the hosts 
> that have cached by my caching NS are fetched quickly, but new lookups are 
> taking an age and slowing things down. However, I don't believe my NS is 
> slow by any means. And furthermore, the old Fetcher1 never had this 
> problem.
>
> Any ideas where to look to track this down?
>
> Thanks,
> Roger
>
> --------------------------------------------------
> From: "Roger Dunk" <ro...@at.com.au>
> Sent: Thursday, February 05, 2009 2:16 PM
> To: <nu...@lucene.apache.org>
> Subject: Re: Fetcher2 Slow
>
>> It makes no difference if I set fetcher.threads.per.host to 1 or 100, 
>> which I assume is what you were suggesting?
>>
>> I also stated that the majority of pages to fetch were from unique hosts, 
>> so I believe the value of this parameter should not really come into 
>> play.
>>
>> Cheers...
>> Roger
>>
>> --------------------------------------------------
>> From: "Laurent Laborde" <ke...@gmail.com>
>> Sent: Tuesday, February 03, 2009 5:51 PM
>> To: <nu...@lucene.apache.org>
>> Subject: Re: Fetcher2 Slow
>>
>>> On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk <ro...@at.com.au> wrote:
>>>> Hi all,
>>>>
>>>> I'm having no luck whatsoever using Fetcher2, as even with 50 threads 
>>>> enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 
>>>> 0 hosts in the queue. I do however have some 50,000 pages to fetch, the 
>>>> majority of which are from unique hosts.
>>>>
>>>> The regular fetcher works as expected, fetching concurrently from 50 
>>>> hosts.
>>>
>>> There is a configuration parameters limiting the concurent fetcher per
>>> unique host.
>>>
>>> -- 
>>> F4FQM
>>> Kerunix Flan
>>> Laurent Laborde
>> 

Re: Fetcher2 Slow

Posted by Roger Dunk <ro...@at.com.au>.
Now that the soon to be released v1 uses Fetcher2 as default (or as the only 
fetcher available?), I would think that this slowness problem that is facing 
a number of users might be addressed?

In short the case for me is like this:

Nutch trunk revision 755143
JDK 1.6_12 on Linux

Crawl list consists of ~40,000 URLs from dmoz, so naturally are well 
distributed among hosts (i.e. mostly unique hosts).

Config options:
fetcher.threads.fetch = 80
fetcher.threads.per.host = 80
fetcher.server.delay = 0

The result?

Most of the time, something like this:

activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0

If I'm lucky, it might fetch around 1 page per second (or less).

What I have noticed is that if I let it run for a while, cancel the fetch, 
and start it again from the beginning, it runs very quickly for a while 
before it slows right down to a trickle again. My guess is that the hosts 
that have cached by my caching NS are fetched quickly, but new lookups are 
taking an age and slowing things down. However, I don't believe my NS is 
slow by any means. And furthermore, the old Fetcher1 never had this problem.

Any ideas where to look to track this down?

Thanks,
Roger

--------------------------------------------------
From: "Roger Dunk" <ro...@at.com.au>
Sent: Thursday, February 05, 2009 2:16 PM
To: <nu...@lucene.apache.org>
Subject: Re: Fetcher2 Slow

> It makes no difference if I set fetcher.threads.per.host to 1 or 100, 
> which I assume is what you were suggesting?
>
> I also stated that the majority of pages to fetch were from unique hosts, 
> so I believe the value of this parameter should not really come into play.
>
> Cheers...
> Roger
>
> --------------------------------------------------
> From: "Laurent Laborde" <ke...@gmail.com>
> Sent: Tuesday, February 03, 2009 5:51 PM
> To: <nu...@lucene.apache.org>
> Subject: Re: Fetcher2 Slow
>
>> On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk <ro...@at.com.au> wrote:
>>> Hi all,
>>>
>>> I'm having no luck whatsoever using Fetcher2, as even with 50 threads 
>>> enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 
>>> hosts in the queue. I do however have some 50,000 pages to fetch, the 
>>> majority of which are from unique hosts.
>>>
>>> The regular fetcher works as expected, fetching concurrently from 50 
>>> hosts.
>>
>> There is a configuration parameters limiting the concurent fetcher per
>> unique host.
>>
>> -- 
>> F4FQM
>> Kerunix Flan
>> Laurent Laborde
> 

Re: Fetcher2 Slow

Posted by Roger Dunk <ro...@at.com.au>.
It makes no difference if I set fetcher.threads.per.host to 1 or 100, which 
I assume is what you were suggesting?

I also stated that the majority of pages to fetch were from unique hosts, so 
I believe the value of this parameter should not really come into play.

Cheers...
Roger

--------------------------------------------------
From: "Laurent Laborde" <ke...@gmail.com>
Sent: Tuesday, February 03, 2009 5:51 PM
To: <nu...@lucene.apache.org>
Subject: Re: Fetcher2 Slow

> On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk <ro...@at.com.au> wrote:
>> Hi all,
>>
>> I'm having no luck whatsoever using Fetcher2, as even with 50 threads 
>> enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 
>> hosts in the queue. I do however have some 50,000 pages to fetch, the 
>> majority of which are from unique hosts.
>>
>> The regular fetcher works as expected, fetching concurrently from 50 
>> hosts.
>
> There is a configuration parameters limiting the concurent fetcher per
> unique host.
>
> -- 
> F4FQM
> Kerunix Flan
> Laurent Laborde 


Re: Fetcher2 Slow

Posted by Laurent Laborde <ke...@gmail.com>.
On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk <ro...@at.com.au> wrote:
> Hi all,
>
> I'm having no luck whatsoever using Fetcher2, as even with 50 threads enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 hosts in the queue. I do however have some 50,000 pages to fetch, the majority of which are from unique hosts.
>
> The regular fetcher works as expected, fetching concurrently from 50 hosts.

There is a configuration parameters limiting the concurent fetcher per
unique host.

-- 
F4FQM
Kerunix Flan
Laurent Laborde