You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by caezar <ca...@gmail.com> on 2009/06/25 16:04:14 UTC

Nutch fetch performance

Hi All,

I have 15 machines in hadoop farm. While fetching, I've got about 10 pages/s
(4000kb/s) per machine. I suppose it is very slow. I've set mapred.map.tasks
and mapred.reduce.tasks to 15. Is this correct? HTTP timeout is 5 seconds,
max reties 2, 0.5 seconds between retries. fetcher.threads.fetch is 300. How
can I tweak the performance? What other options may affect performance?
Should I provide some other information for you to be able to help me?

Thanks
-- 
View this message in context: http://www.nabble.com/Nutch-fetch-performance-tp24203861p24203861.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch fetch performance

Posted by Ken Krugler <kk...@transpac.com>.
>Ken Krugler wrote:
>>
>  > If this is http.timeout, that's the length of time an HTTP request
>>  will wait for a response before timing out. Which hopefully doesn't
>>  happen very often for you.
>>
>Yes, it is it.
>
>Ken Krugler wrote:
>>
>>  "Delay between retries" - what property name is this? Is it
>  > fetcher.server.delay? By default this is set to 5.0, so did you crank
>>  it down?
>  >
>Yes, it is it.

And so you adjusted this from 5.0 to 0.5?

>Ken Krugler wrote:
>>
>>    -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
>>    - fetching http://home.swipnet.se/~w-147200/
>>
>What is spinWaiting number here?

Number of threads that can't do any work, because there are no 
available URLs to fetch (due to constraints from various settings).

-- Ken
-- 
Ken Krugler
+1 530-210-6378

Re: Nutch fetch performance

Posted by caezar <ca...@gmail.com>.

Ken Krugler wrote:
> 
> If this is http.timeout, that's the length of time an HTTP request 
> will wait for a response before timing out. Which hopefully doesn't 
> happen very often for you.
> 
Yes, it is it.

Ken Krugler wrote:
> 
> "Delay between retries" - what property name is this? Is it 
> fetcher.server.delay? By default this is set to 5.0, so did you crank 
> it down?
> 
Yes, it is it.

Ken Krugler wrote:
> 
>   -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
>   - fetching http://home.swipnet.se/~w-147200/
> 
What is spinWaiting number here?


-- 
View this message in context: http://www.nabble.com/Nutch-fetch-performance-tp24203861p24223841.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch fetch performance

Posted by Ken Krugler <kk...@transpac.com>.
>I don't think that they are waiting for 30 seconds. Http timeout is 5
>seconds.

If this is http.timeout, that's the length of time an HTTP request 
will wait for a response before timing out. Which hopefully doesn't 
happen very often for you.

>Number of retries is 2. Delay between retries is 0.5 seconds.

"Delay between retries" - what property name is this? Is it 
fetcher.server.delay? By default this is set to 5.0, so did you crank 
it down?

>So it
>should be at most 10.5 seconds.
>But however, is there a way to see the number of active fetcher threads?

Check the log files. You should see entries like:

  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
  - fetching http://home.swipnet.se/~w-147200/

-- Ken


>Ken Krugler wrote:
>>
>>  The real question is how many active fetches you have running
>>  simultaneously. If most fetcher threads are idle, waiting for 30
>>  seconds to pass before fetching the next page from a host, then 10
>>  pages/second might be an expected fetch rate.
>>
>
>--
>View this message in context: 
>http://www.nabble.com/Nutch-fetch-performance-tp24203861p24216312.html
>Sent from the Nutch - User mailing list archive at Nabble.com.


-- 
Ken Krugler
+1 530-210-6378

Re: Nutch fetch performance

Posted by Otis Gospodnetic <og...@yahoo.com>.
I remember seeing those in the logs, but it's been a while.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: caezar <ca...@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Friday, June 26, 2009 3:50:39 AM
> Subject: Re: Nutch fetch performance
> 
> 
> I don't think that they are waiting for 30 seconds. Http timeout is 5
> seconds. Number of retries is 2. Delay between retries is 0.5 seconds. So it
> should be at most 10.5 seconds.
> But however, is there a way to see the number of active fetcher threads?
> 
> Ken Krugler wrote:
> > 
> > The real question is how many active fetches you have running 
> > simultaneously. If most fetcher threads are idle, waiting for 30 
> > seconds to pass before fetching the next page from a host, then 10 
> > pages/second might be an expected fetch rate.
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Nutch-fetch-performance-tp24203861p24216312.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch fetch performance

Posted by caezar <ca...@gmail.com>.
I don't think that they are waiting for 30 seconds. Http timeout is 5
seconds. Number of retries is 2. Delay between retries is 0.5 seconds. So it
should be at most 10.5 seconds.
But however, is there a way to see the number of active fetcher threads?

Ken Krugler wrote:
> 
> The real question is how many active fetches you have running 
> simultaneously. If most fetcher threads are idle, waiting for 30 
> seconds to pass before fetching the next page from a host, then 10 
> pages/second might be an expected fetch rate.
> 

-- 
View this message in context: http://www.nabble.com/Nutch-fetch-performance-tp24203861p24216312.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch fetch performance

Posted by Ken Krugler <kk...@transpac.com>.
>I've got 100000 URLs at 13575 hosts. Is this the case, you are talking about?
>Is the fetch speed, shown in first post is ok? (4000 kb/s is kilobits).

The real question is how many active fetches you have running 
simultaneously. If most fetcher threads are idle, waiting for 30 
seconds to pass before fetching the next page from a host, then 10 
pages/second might be an expected fetch rate.

At 7:04 am -0700 6/25/09, caezar wrote:
>I have 15 machines in hadoop farm. While fetching, I've got about 10 pages/s
>(4000kb/s) per machine. I suppose it is very slow. I've set mapred.map.tasks
>and mapred.reduce.tasks to 15. Is this correct? HTTP timeout is 5 seconds,
>max reties 2, 0.5 seconds between retries. fetcher.threads.fetch is 300. How
>can I tweak the performance? What other options may affect performance?
>Should I provide some other information for you to be able to help me?

I'm not completely sure about how Nutch uses the 
fetcher.threads.fetch value, but I believe it's per map task. If so, 
you've got 15 * 15 * 300 = 67K+ fetchers. With only 13575 hosts 
that's way too high.

If you have a typical distribution of URLs/host, this is an 
exponential curve - so a few hosts (e.g. < 1000) will have the bulk 
of the URLs, and the remainder will have only a few each.

If that's the case, then you'd probably get near-equivalent 
performance from a single server setup - the 15 machines aren't 
buying you much.

BTW, typically I set the number of mappers to the number of cores (on 
a server) * 5, and the number of reducers == the number of cores. Oh, 
and the number of threads to 200/# of mappers. But treat that as a 
random data point.

-- Ken


>Ken Krugler wrote:
>>
>>  See the previous discussion about how having relatively few unique
>>  domains can significantly limit the polite crawl rate. I'd also
>  > posted something along these lines at:

-- 
Ken Krugler
+1 530-210-6378

Re: Nutch fetch performance

Posted by caezar <ca...@gmail.com>.
I've got 100000 URLs at 13575 hosts. Is this the case, you are talking about?
Is the fetch speed, shown in first post is ok? (4000 kb/s is kilobits).

Ken Krugler wrote:
> 
> See the previous discussion about how having relatively few unique 
> domains can significantly limit the polite crawl rate. I'd also 
> posted something along these lines at:
> 

-- 
View this message in context: http://www.nabble.com/Nutch-fetch-performance-tp24203861p24206006.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch fetch performance

Posted by Ken Krugler <kk...@transpac.com>.
>Becides, seems that number of fetcher threads does not affects anything. Same
>result for 20 and for 1000 threads.

See the previous discussion about how having relatively few unique 
domains can significantly limit the polite crawl rate. I'd also 
posted something along these lines at:

http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/

-- Ken


>caezar wrote:
>>
>>  Hi All,
>>
>>  I have 15 machines in hadoop farm. While fetching, I've got about 10
>>  pages/s (4000kb/s) per machine. I suppose it is very slow. I've set
>>  mapred.map.tasks and mapred.reduce.tasks to 15. Is this correct? HTTP
>>  timeout is 5 seconds, max reties 2, 0.5 seconds between retries.
>>  fetcher.threads.fetch is 300. How can I tweak the performance? What other
>>  options may affect performance? Should I provide some other information
>  > for you to be able to help me?

-- 
Ken Krugler
+1 530-210-6378

Re: Nutch fetch performance

Posted by caezar <ca...@gmail.com>.
Becides, seems that number of fetcher threads does not affects anything. Same
result for 20 and for 1000 threads.

caezar wrote:
> 
> Hi All,
> 
> I have 15 machines in hadoop farm. While fetching, I've got about 10
> pages/s (4000kb/s) per machine. I suppose it is very slow. I've set
> mapred.map.tasks and mapred.reduce.tasks to 15. Is this correct? HTTP
> timeout is 5 seconds, max reties 2, 0.5 seconds between retries.
> fetcher.threads.fetch is 300. How can I tweak the performance? What other
> options may affect performance? Should I provide some other information
> for you to be able to help me?
> 
> Thanks
> 

-- 
View this message in context: http://www.nabble.com/Nutch-fetch-performance-tp24203861p24203907.html
Sent from the Nutch - User mailing list archive at Nabble.com.