You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Zaheed Haque <za...@gmail.com> on 2006/07/10 11:42:17 UTC

Re: .8 svn - fetcher performance..

On 6/28/06, Ken Krugler <kk...@transpac.com> wrote:
> Hi Doug,
>
> >Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >running into a similar problem.
>
> We wound up dramatically increasing the number of threads, which
> seemed to help solve the bandwidth utilization problem. With Nutch
> 0.7 we were running about 200 threads per crawler, and with Nutch 0.8
> it's more like 2000+ threads...though you have to reduce the thread
> stack size in this type of configuration.

Hi Ken

Could you please give me some clue regarding the stack size you are
seeing the best bandwidth utilization... I have the following

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
max rt priority                 (-r) unlimited
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

What stack size should I play with the default seems to be 8192kb ?
also any onther parameters I should tweak? I often get too many open
files problem and I never could use my full bandwidth.. I am using
about 10% of my bandwidth. I have played around with ulimit -n "very
high number" which solves the "too many open files" but its not
utilizing all my bandwidth, any help will be very much appreciated.

Thanks
Zaheed


> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>

Re: .8 svn - fetcher performance..

Posted by Zaheed Haque <za...@gmail.com>.

Ken:

Thank you very much for the info, I applied it my testing enviornment
and I could see big changes in my bandwidth utilization. I have tried
it on a simple server and i could get a rather constant 25-29
pages/sec in a vertical crawl. Previously I was getting about 5-7
pages/sec.

Cheers
Zaheed


On 7/11/06, Ken Krugler <kk...@transpac.com> wrote:
> >On 6/28/06, Ken Krugler <kk...@transpac.com> wrote:
> >>Hi Doug,
> >>
> >>>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >>>running into a similar problem.
> >>
> >>We wound up dramatically increasing the number of threads, which
> >>seemed to help solve the bandwidth utilization problem. With Nutch
> >>0.7 we were running about 200 threads per crawler, and with Nutch 0.8
> >>it's more like 2000+ threads...though you have to reduce the thread
> >>stack size in this type of configuration.
> >
> >Hi Ken
> >
> >Could you please give me some clue regarding the stack size you are
> >seeing the best bandwidth utilization...
>
> Note that stack size twiddling is only done to allow for increasing
> the number of fetcher threads without running of out JVM or OS memory.
>
> >  I have the following
> >
> >core file size          (blocks, -c) 0
> >data seg size           (kbytes, -d) unlimited
> >max nice                        (-e) 20
> >file size               (blocks, -f) unlimited
> >pending signals                 (-i) unlimited
> >max locked memory       (kbytes, -l) unlimited
> >max memory size         (kbytes, -m) unlimited
> >open files                      (-n) 1024
> >pipe size            (512 bytes, -p) 8
> >POSIX message queues     (bytes, -q) unlimited
> >max rt priority                 (-r) unlimited
> >stack size              (kbytes, -s) 8192
> >cpu time               (seconds, -t) unlimited
> >max user processes              (-u) unlimited
> >virtual memory          (kbytes, -v) unlimited
> >file locks                      (-x) unlimited
> >
> >What stack size should I play with the default seems to be 8192kb ?
>
> We use something like ulimit -s 512 to set a 512K stack size at the OS level.
>
> >also any onther parameters I should tweak?
>
> We specify -Xss512K when running the fetch map-reduce task to set the
> stack size in the JVM. But I don't remember off the top of my head
> which of the many different config files this gets set in. Stefan?
> >
> >I often get too many open
> >files problem
>
> That's a separate issue.
>
> >and I never could use my full bandwidth.. I am using
> >about 10% of my bandwidth. I have played around with ulimit -n "very
> >high number" which solves the "too many open files" but its not
> >utilizing all my bandwidth, any help will be very much appreciated.
>
> Try increasing the number of fetcher threads and reducing the stack
> size. With 10 high-end servers in a cluster, we were able to max out
> a 100mbs connection for brief periods, though as our crawl converged
> (because it's a vertical crawl) the max rate drops eventually to
> about 50mps.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>

Re: .8 svn - fetcher performance..

Posted by Ken Krugler <kk...@transpac.com>.

>On 6/28/06, Ken Krugler <kk...@transpac.com> wrote:
>>Hi Doug,
>>
>>>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>>>running into a similar problem.
>>
>>We wound up dramatically increasing the number of threads, which
>>seemed to help solve the bandwidth utilization problem. With Nutch
>>0.7 we were running about 200 threads per crawler, and with Nutch 0.8
>>it's more like 2000+ threads...though you have to reduce the thread
>>stack size in this type of configuration.
>
>Hi Ken
>
>Could you please give me some clue regarding the stack size you are
>seeing the best bandwidth utilization...

Note that stack size twiddling is only done to allow for increasing 
the number of fetcher threads without running of out JVM or OS memory.

>  I have the following
>
>core file size          (blocks, -c) 0
>data seg size           (kbytes, -d) unlimited
>max nice                        (-e) 20
>file size               (blocks, -f) unlimited
>pending signals                 (-i) unlimited
>max locked memory       (kbytes, -l) unlimited
>max memory size         (kbytes, -m) unlimited
>open files                      (-n) 1024
>pipe size            (512 bytes, -p) 8
>POSIX message queues     (bytes, -q) unlimited
>max rt priority                 (-r) unlimited
>stack size              (kbytes, -s) 8192
>cpu time               (seconds, -t) unlimited
>max user processes              (-u) unlimited
>virtual memory          (kbytes, -v) unlimited
>file locks                      (-x) unlimited
>
>What stack size should I play with the default seems to be 8192kb ?

We use something like ulimit -s 512 to set a 512K stack size at the OS level.

>also any onther parameters I should tweak?

We specify -Xss512K when running the fetch map-reduce task to set the 
stack size in the JVM. But I don't remember off the top of my head 
which of the many different config files this gets set in. Stefan?
>
>I often get too many open
>files problem

That's a separate issue.

>and I never could use my full bandwidth.. I am using
>about 10% of my bandwidth. I have played around with ulimit -n "very
>high number" which solves the "too many open files" but its not
>utilizing all my bandwidth, any help will be very much appreciated.

Try increasing the number of fetcher threads and reducing the stack 
size. With 10 high-end servers in a cluster, we were able to max out 
a 100mbs connection for brief periods, though as our crawl converged 
(because it's a vertical crawl) the max rate drops eventually to 
about 50mps.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"