You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kaveh minooie <ka...@plutoz.com> on 2012/02/14 23:00:57 UTC

fetcher.threads.per.queue and fetcher.server.delay

So I am trying to optimize the fetch performance, and I think that I 
miserably failing since I am not able to max out any my resources (cpu, 
ram, and more importantly bandwidth). obviously I am not trying to max 
out all of them at the same time. I just to find out the bottle neck, 
and I can't see any at this point. so here are the questions:

1- how do fetcher.threads.per.queue   and  fetcher.server.delay affect 
each other? if the threads.per.queue is 8 and server.delay is 2 seconds, 
does that mean that I am going have about 8 fetch every 2 seconds on 
each site(I don't care about the distribution but the max number of hits 
that the target site will receive from my crawler. would that be about 8 
each 2 seconds)?

2. this is sort of a follow up on the first one. but I have my 
fetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8. 
but when I check the log files I see that most of the time (specially 
toward the end of the job when only sites with too many pages are left) 
8 of the threads are alway just spin waiting. wouldn't it be better to 
have the same value for threads.fetch and threads.per.queue?

3. the nutch-default says something about fetch running one map task per 
node (in fetcher.threads.fetch description) I have quad core cpus so I 
have setup the hadoop to run 4 task per node and i have 16 threads per 
map. Is there any down side in doing that? is there any optimal value 
for this?  you should also know that when I double the size of the 
threads (to 32) I get timeout error for roughly 80% of all the pages 
WITHOUT seeing any real load on my bandwidth as if the node just can't 
handle that many threads.


any comment would be really appreciated even if its just to say that I 
am stupid. thanks,

-- 
Kaveh Minooie

www.plutoz.com

Re: fetcher.threads.per.queue and fetcher.server.delay

Posted by Markus Jelsma <ma...@openindex.io>.

> So I am trying to optimize the fetch performance, and I think that I
> miserably failing since I am not able to max out any my resources (cpu,
> ram, and more importantly bandwidth). obviously I am not trying to max
> out all of them at the same time. I just to find out the bottle neck,
> and I can't see any at this point. so here are the questions:
> 
> 1- how do fetcher.threads.per.queue   and  fetcher.server.delay affect
> each other? if the threads.per.queue is 8 and server.delay is 2 seconds,
> does that mean that I am going have about 8 fetch every 2 seconds on
> each site(I don't care about the distribution but the max number of hits
> that the target site will receive from my crawler. would that be about 8
> each 2 seconds)?

Hmm, i don't really know how the behaviour is supposed to be with multiple 
threads per queue and a delay. In any case, multiple threads per queue is not 
advised due to politeness. Also, a single thread at full speed can do A LOT of 
fetches per second.

> 
> 2. this is sort of a follow up on the first one. but I have my
> fetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8.
> but when I check the log files I see that most of the time (specially
> toward the end of the job when only sites with too many pages are left)
> 8 of the threads are alway just spin waiting. wouldn't it be better to
> have the same value for threads.fetch and threads.per.queue?

Check the wiki for optimizing the crawl. Fetchers that slow down at the end 
are due to unevenly distributed fetch lists. You can get better distribution 
if you don't generate too many records per queue.
> 
> 3. the nutch-default says something about fetch running one map task per
> node (in fetcher.threads.fetch description) I have quad core cpus so I
> have setup the hadoop to run 4 task per node and i have 16 threads per
> map. Is there any down side in doing that? is there any optimal value
> for this?  you should also know that when I double the size of the
> threads (to 32) I get timeout error for roughly 80% of all the pages
> WITHOUT seeing any real load on my bandwidth as if the node just can't
> handle that many threads.

The optimal value is hard to predict. It depends on hardware, OS settings 
(e.g. open files, time outs, low level network settings). Also keep in mind 
that fetching alone is not very CPU intensive unless you're parsing during 
fetch.
You also need to check syslog and certainly check ulimit for open files.
> 
> 
> any comment would be really appreciated even if its just to say that I
> am stupid. thanks,

Re: fetcher.threads.per.queue and fetcher.server.delay

Posted by Markus Jelsma <ma...@apache.org>.

> So I am trying to optimize the fetch performance, and I think that I
> miserably failing since I am not able to max out any my resources (cpu,
> ram, and more importantly bandwidth). obviously I am not trying to max
> out all of them at the same time. I just to find out the bottle neck,
> and I can't see any at this point. so here are the questions:
> 
> 1- how do fetcher.threads.per.queue   and  fetcher.server.delay affect
> each other? if the threads.per.queue is 8 and server.delay is 2 seconds,
> does that mean that I am going have about 8 fetch every 2 seconds on
> each site(I don't care about the distribution but the max number of hits
> that the target site will receive from my crawler. would that be about 8
> each 2 seconds)?

Hmm, i don't really know how the behaviour is supposed to be with multiple 
threads per queue and a delay. In any case, multiple threads per queue is not 
advised due to politeness. Also, a single thread at full speed can do A LOT of 
fetches per second.

> 
> 2. this is sort of a follow up on the first one. but I have my
> fetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8.
> but when I check the log files I see that most of the time (specially
> toward the end of the job when only sites with too many pages are left)
> 8 of the threads are alway just spin waiting. wouldn't it be better to
> have the same value for threads.fetch and threads.per.queue?

Check the wiki for optimizing the crawl. Fetchers that slow down at the end 
are due to unevenly distributed fetch lists. You can get better distribution 
if you don't generate too many records per queue.
> 
> 3. the nutch-default says something about fetch running one map task per
> node (in fetcher.threads.fetch description) I have quad core cpus so I
> have setup the hadoop to run 4 task per node and i have 16 threads per
> map. Is there any down side in doing that? is there any optimal value
> for this?  you should also know that when I double the size of the
> threads (to 32) I get timeout error for roughly 80% of all the pages
> WITHOUT seeing any real load on my bandwidth as if the node just can't
> handle that many threads.

The optimal value is hard to predict. It depends on hardware, OS settings 
(e.g. open files, time outs, low level network settings). Also keep in mind 
that fetching alone is not very CPU intensive unless you're parsing during 
fetch.
You also need to check syslog and certainly check ulimit for open files.
> 
> 
> any comment would be really appreciated even if its just to say that I
> am stupid. thanks,