You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kaveh minooie <ka...@plutoz.com> on 2012/02/14 23:00:57 UTC
fetcher.threads.per.queue and fetcher.server.delay
So I am trying to optimize the fetch performance, and I think that I
miserably failing since I am not able to max out any my resources (cpu,
ram, and more importantly bandwidth). obviously I am not trying to max
out all of them at the same time. I just to find out the bottle neck,
and I can't see any at this point. so here are the questions:
1- how do fetcher.threads.per.queue and fetcher.server.delay affect
each other? if the threads.per.queue is 8 and server.delay is 2 seconds,
does that mean that I am going have about 8 fetch every 2 seconds on
each site(I don't care about the distribution but the max number of hits
that the target site will receive from my crawler. would that be about 8
each 2 seconds)?
2. this is sort of a follow up on the first one. but I have my
fetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8.
but when I check the log files I see that most of the time (specially
toward the end of the job when only sites with too many pages are left)
8 of the threads are alway just spin waiting. wouldn't it be better to
have the same value for threads.fetch and threads.per.queue?
3. the nutch-default says something about fetch running one map task per
node (in fetcher.threads.fetch description) I have quad core cpus so I
have setup the hadoop to run 4 task per node and i have 16 threads per
map. Is there any down side in doing that? is there any optimal value
for this? you should also know that when I double the size of the
threads (to 32) I get timeout error for roughly 80% of all the pages
WITHOUT seeing any real load on my bandwidth as if the node just can't
handle that many threads.
any comment would be really appreciated even if its just to say that I
am stupid. thanks,
--
Kaveh Minooie
www.plutoz.com
Re: fetcher.threads.per.queue and fetcher.server.delay
Posted by Markus Jelsma <ma...@openindex.io>.
> So I am trying to optimize the fetch performance, and I think that I
> miserably failing since I am not able to max out any my resources (cpu,
> ram, and more importantly bandwidth). obviously I am not trying to max
> out all of them at the same time. I just to find out the bottle neck,
> and I can't see any at this point. so here are the questions:
>
> 1- how do fetcher.threads.per.queue and fetcher.server.delay affect
> each other? if the threads.per.queue is 8 and server.delay is 2 seconds,
> does that mean that I am going have about 8 fetch every 2 seconds on
> each site(I don't care about the distribution but the max number of hits
> that the target site will receive from my crawler. would that be about 8
> each 2 seconds)?
Hmm, i don't really know how the behaviour is supposed to be with multiple
threads per queue and a delay. In any case, multiple threads per queue is not
advised due to politeness. Also, a single thread at full speed can do A LOT of
fetches per second.
>
> 2. this is sort of a follow up on the first one. but I have my
> fetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8.
> but when I check the log files I see that most of the time (specially
> toward the end of the job when only sites with too many pages are left)
> 8 of the threads are alway just spin waiting. wouldn't it be better to
> have the same value for threads.fetch and threads.per.queue?
Check the wiki for optimizing the crawl. Fetchers that slow down at the end
are due to unevenly distributed fetch lists. You can get better distribution
if you don't generate too many records per queue.
>
> 3. the nutch-default says something about fetch running one map task per
> node (in fetcher.threads.fetch description) I have quad core cpus so I
> have setup the hadoop to run 4 task per node and i have 16 threads per
> map. Is there any down side in doing that? is there any optimal value
> for this? you should also know that when I double the size of the
> threads (to 32) I get timeout error for roughly 80% of all the pages
> WITHOUT seeing any real load on my bandwidth as if the node just can't
> handle that many threads.
The optimal value is hard to predict. It depends on hardware, OS settings
(e.g. open files, time outs, low level network settings). Also keep in mind
that fetching alone is not very CPU intensive unless you're parsing during
fetch.
You also need to check syslog and certainly check ulimit for open files.
>
>
> any comment would be really appreciated even if its just to say that I
> am stupid. thanks,
Re: fetcher.threads.per.queue and fetcher.server.delay
Posted by Markus Jelsma <ma...@apache.org>.
> So I am trying to optimize the fetch performance, and I think that I
> miserably failing since I am not able to max out any my resources (cpu,
> ram, and more importantly bandwidth). obviously I am not trying to max
> out all of them at the same time. I just to find out the bottle neck,
> and I can't see any at this point. so here are the questions:
>
> 1- how do fetcher.threads.per.queue and fetcher.server.delay affect
> each other? if the threads.per.queue is 8 and server.delay is 2 seconds,
> does that mean that I am going have about 8 fetch every 2 seconds on
> each site(I don't care about the distribution but the max number of hits
> that the target site will receive from my crawler. would that be about 8
> each 2 seconds)?
Hmm, i don't really know how the behaviour is supposed to be with multiple
threads per queue and a delay. In any case, multiple threads per queue is not
advised due to politeness. Also, a single thread at full speed can do A LOT of
fetches per second.
>
> 2. this is sort of a follow up on the first one. but I have my
> fetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8.
> but when I check the log files I see that most of the time (specially
> toward the end of the job when only sites with too many pages are left)
> 8 of the threads are alway just spin waiting. wouldn't it be better to
> have the same value for threads.fetch and threads.per.queue?
Check the wiki for optimizing the crawl. Fetchers that slow down at the end
are due to unevenly distributed fetch lists. You can get better distribution
if you don't generate too many records per queue.
>
> 3. the nutch-default says something about fetch running one map task per
> node (in fetcher.threads.fetch description) I have quad core cpus so I
> have setup the hadoop to run 4 task per node and i have 16 threads per
> map. Is there any down side in doing that? is there any optimal value
> for this? you should also know that when I double the size of the
> threads (to 32) I get timeout error for roughly 80% of all the pages
> WITHOUT seeing any real load on my bandwidth as if the node just can't
> handle that many threads.
The optimal value is hard to predict. It depends on hardware, OS settings
(e.g. open files, time outs, low level network settings). Also keep in mind
that fetching alone is not very CPU intensive unless you're parsing during
fetch.
You also need to check syslog and certainly check ulimit for open files.
>
>
> any comment would be really appreciated even if its just to say that I
> am stupid. thanks,