You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Murat Ali Bayir <mu...@agmlab.com> on 2006/08/03 10:07:53 UTC
-numFetchers in generate command
Hi everbody, Although we give number of Fetchers in generate command,
our system always produce fixed number of part in reduce process? What
can be reason for this? Do we have to change anything in configuration
file of Hadoop?
Re: -numFetchers in generate command
Posted by Andrzej Bialecki <ab...@getopt.org>.
Vishal Shah wrote:
> Hey Andrei,
>
> Thanks a lot for the reply. That clears up a major doubt in my mind.
> Fyi, I experimented using a single machine to crawl using Hadoop DFS,
> MapReduce. The largest experiment was to crawl around 300K pages from a
> few thousand hosts. I could push the crawler to a speed of around 27
> pages/sec when using 2000 threads. When I increased the number of
> threads to more than 3000, the jobs started failing.
>
Look into the logs - most probably fetching failed because of protocol
timeouts, which could indicate that you saturated your available
bandwidth. You can calculate the max throughput of your line and see if
these 27 pages/s is near this limit. If it is, then increasing the
number of threads, or the number of machines won't speed up things.
> I am now going to conduct a larger experiment on 3-4 machines. Will
> report the performance once I am done. In this case, since I know the
> optimal # of threads on 1 machine is 2000, should I scale the #threads
> linearly to say 6000 for 3 machines, or just increasing the number of
> map/red tasks linearly will take care of the scaling?
>
If you hit the max. bandwidth available for you, then adding more
machines with the same number of threads will only cause more fetches to
fail because of timeouts - in such case you should decrease the number
of threads accordingly.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: -numFetchers in generate command
Posted by Vishal Shah <vi...@rediff.co.in>.
Hey Andrei,
Thanks a lot for the reply. That clears up a major doubt in my mind.
Fyi, I experimented using a single machine to crawl using Hadoop DFS,
MapReduce. The largest experiment was to crawl around 300K pages from a
few thousand hosts. I could push the crawler to a speed of around 27
pages/sec when using 2000 threads. When I increased the number of
threads to more than 3000, the jobs started failing.
I am now going to conduct a larger experiment on 3-4 machines. Will
report the performance once I am done. In this case, since I know the
optimal # of threads on 1 machine is 2000, should I scale the #threads
linearly to say 6000 for 3 machines, or just increasing the number of
map/red tasks linearly will take care of the scaling?
Thanks,
-vishal.
-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: Friday, August 25, 2006 5:46 PM
To: nutch-user@lucene.apache.org
Subject: Re: -numFetchers in generate command
Vishal Shah wrote:
> Hi Andrei,
>
> I am running some experiments to figure out what numThreads param
to
> use while fetching on my machine. I made the mistake of putting the #
of
> map/reduce tasks in hadoop-site.xml and not in mapred-default.xml,
> however I can clearly see a change in performace for different numbers
> of threads (I tested using 5 different options, ranging from 10 to
> 2000).
>
> I was wondering why I am seeing these performance changes even
though
> the number of reduce parts is only 2 for all the experiments. Also,
how
> is the number of fetcher threads param used during generate related to
> the numthreads param used during fetch?
>
Well, you will always run as many fetching (map) tasks as many parts you
created when running Generator's reduce phase. Now, each fetching task
can run multiple fetching threads in parallel ... so, as you increase
the number of threads your fetching performance will likely increase
(unless you face some other limits, like the blocked addresses and your
bandwidth limits).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: -numFetchers in generate command
Posted by Andrzej Bialecki <ab...@getopt.org>.
Vishal Shah wrote:
> Hi Andrei,
>
> I am running some experiments to figure out what numThreads param to
> use while fetching on my machine. I made the mistake of putting the # of
> map/reduce tasks in hadoop-site.xml and not in mapred-default.xml,
> however I can clearly see a change in performace for different numbers
> of threads (I tested using 5 different options, ranging from 10 to
> 2000).
>
> I was wondering why I am seeing these performance changes even though
> the number of reduce parts is only 2 for all the experiments. Also, how
> is the number of fetcher threads param used during generate related to
> the numthreads param used during fetch?
>
Well, you will always run as many fetching (map) tasks as many parts you
created when running Generator's reduce phase. Now, each fetching task
can run multiple fetching threads in parallel ... so, as you increase
the number of threads your fetching performance will likely increase
(unless you face some other limits, like the blocked addresses and your
bandwidth limits).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: -numFetchers in generate command
Posted by Vishal Shah <vi...@rediff.co.in>.
Hi Andrei,
I am running some experiments to figure out what numThreads param to
use while fetching on my machine. I made the mistake of putting the # of
map/reduce tasks in hadoop-site.xml and not in mapred-default.xml,
however I can clearly see a change in performace for different numbers
of threads (I tested using 5 different options, ranging from 10 to
2000).
I was wondering why I am seeing these performance changes even though
the number of reduce parts is only 2 for all the experiments. Also, how
is the number of fetcher threads param used during generate related to
the numthreads param used during fetch?
Thank you,
-vishal.
-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: Thursday, August 03, 2006 8:43 PM
To: nutch-user@lucene.apache.org
Subject: Re: -numFetchers in generate command
Murat Ali Bayir wrote:
> Hi everbody, Although we give number of Fetchers in generate command,
> our system always produce fixed number of part in reduce process? What
> can be reason for this? Do we have to change anything in configuration
> file of Hadoop?
Most probably you put the numbers of map/reduce tasks in your
hadoop-site.xml, right? Move them to mapred-default.xml. Any property
that you put into hadoop-site.xml will override all, even job-specific
settings.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: -numFetchers in generate command
Posted by Andrzej Bialecki <ab...@getopt.org>.
Murat Ali Bayir wrote:
> Hi everbody, Although we give number of Fetchers in generate command,
> our system always produce fixed number of part in reduce process? What
> can be reason for this? Do we have to change anything in configuration
> file of Hadoop?
Most probably you put the numbers of map/reduce tasks in your
hadoop-site.xml, right? Move them to mapred-default.xml. Any property
that you put into hadoop-site.xml will override all, even job-specific
settings.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com