You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Murat Ali Bayir <mu...@agmlab.com> on 2006/08/03 10:07:53 UTC

-numFetchers in generate command

Hi everbody, Although we give number of Fetchers in generate command, 
our system always produce fixed number of part in reduce process? What 
can be reason for this? Do we have to change anything in configuration 
file of Hadoop?

Re: -numFetchers in generate command

Posted by Andrzej Bialecki <ab...@getopt.org>.

Vishal Shah wrote:
> Hey Andrei,
>
>   Thanks a lot for the reply. That clears up a major doubt in my mind.
> Fyi, I experimented using a single machine to crawl using Hadoop DFS,
> MapReduce. The largest experiment was to crawl around 300K pages from a
> few thousand hosts. I could push the crawler to a speed of around 27
> pages/sec when using 2000 threads. When I increased the number of
> threads to more than 3000, the jobs started failing. 
>   

Look into the logs - most probably fetching failed because of protocol 
timeouts, which could indicate that you saturated your available 
bandwidth. You can calculate the max throughput of your line and see if 
these 27 pages/s is near this limit. If it is, then increasing the 
number of threads, or the number of machines won't speed up things.

> I am now going to conduct a larger experiment on 3-4 machines. Will
> report the performance once I am done. In this case, since I know the
> optimal # of threads on 1 machine is 2000, should I scale the #threads
> linearly to say 6000 for 3 machines, or just increasing the number of
> map/red tasks linearly will take care of the scaling?
>   

If you hit the max. bandwidth available for you, then adding more 
machines with the same number of threads will only cause more fetches to 
fail because of timeouts - in such case you should decrease the number 
of threads accordingly.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: -numFetchers in generate command

Posted by Vishal Shah <vi...@rediff.co.in>.

Hey Andrei,

  Thanks a lot for the reply. That clears up a major doubt in my mind.
Fyi, I experimented using a single machine to crawl using Hadoop DFS,
MapReduce. The largest experiment was to crawl around 300K pages from a
few thousand hosts. I could push the crawler to a speed of around 27
pages/sec when using 2000 threads. When I increased the number of
threads to more than 3000, the jobs started failing. 

I am now going to conduct a larger experiment on 3-4 machines. Will
report the performance once I am done. In this case, since I know the
optimal # of threads on 1 machine is 2000, should I scale the #threads
linearly to say 6000 for 3 machines, or just increasing the number of
map/red tasks linearly will take care of the scaling?

Thanks,

-vishal.

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Friday, August 25, 2006 5:46 PM
To: nutch-user@lucene.apache.org
Subject: Re: -numFetchers in generate command

Vishal Shah wrote:
> Hi Andrei,
>
>    I am running some experiments to figure out what numThreads param
to
> use while fetching on my machine. I made the mistake of putting the #
of
> map/reduce tasks in hadoop-site.xml and not in mapred-default.xml,
> however I can clearly see a change in performace for different numbers
> of threads (I tested using 5 different options, ranging from 10 to
> 2000).
>
>   I was wondering why I am seeing these performance changes even
though
> the number of reduce parts is only 2 for all the experiments. Also,
how
> is the number of fetcher threads param used during generate related to
> the numthreads param used during fetch?
>   

Well, you will always run as many fetching (map) tasks as many parts you

created when running Generator's reduce phase. Now, each fetching task 
can run multiple fetching threads in parallel ... so, as you increase 
the number of threads your fetching performance will likely increase 
(unless you face some other limits, like the blocked addresses and your 
bandwidth limits).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: -numFetchers in generate command

Posted by Andrzej Bialecki <ab...@getopt.org>.

Vishal Shah wrote:
> Hi Andrei,
>
>    I am running some experiments to figure out what numThreads param to
> use while fetching on my machine. I made the mistake of putting the # of
> map/reduce tasks in hadoop-site.xml and not in mapred-default.xml,
> however I can clearly see a change in performace for different numbers
> of threads (I tested using 5 different options, ranging from 10 to
> 2000).
>
>   I was wondering why I am seeing these performance changes even though
> the number of reduce parts is only 2 for all the experiments. Also, how
> is the number of fetcher threads param used during generate related to
> the numthreads param used during fetch?
>   

Well, you will always run as many fetching (map) tasks as many parts you 
created when running Generator's reduce phase. Now, each fetching task 
can run multiple fetching threads in parallel ... so, as you increase 
the number of threads your fetching performance will likely increase 
(unless you face some other limits, like the blocked addresses and your 
bandwidth limits).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: -numFetchers in generate command

Posted by Vishal Shah <vi...@rediff.co.in>.

Hi Andrei,

   I am running some experiments to figure out what numThreads param to
use while fetching on my machine. I made the mistake of putting the # of
map/reduce tasks in hadoop-site.xml and not in mapred-default.xml,
however I can clearly see a change in performace for different numbers
of threads (I tested using 5 different options, ranging from 10 to
2000).

  I was wondering why I am seeing these performance changes even though
the number of reduce parts is only 2 for all the experiments. Also, how
is the number of fetcher threads param used during generate related to
the numthreads param used during fetch?

Thank you,

-vishal.

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Thursday, August 03, 2006 8:43 PM
To: nutch-user@lucene.apache.org
Subject: Re: -numFetchers in generate command

Murat Ali Bayir wrote:
> Hi everbody, Although we give number of Fetchers in generate command, 
> our system always produce fixed number of part in reduce process? What

> can be reason for this? Do we have to change anything in configuration

> file of Hadoop?

Most probably you put the numbers of map/reduce tasks in your 
hadoop-site.xml, right? Move them to mapred-default.xml. Any property 
that you put into hadoop-site.xml will override all, even job-specific 
settings.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: -numFetchers in generate command

Posted by Andrzej Bialecki <ab...@getopt.org>.

Murat Ali Bayir wrote:
> Hi everbody, Although we give number of Fetchers in generate command, 
> our system always produce fixed number of part in reduce process? What 
> can be reason for this? Do we have to change anything in configuration 
> file of Hadoop?

Most probably you put the numbers of map/reduce tasks in your 
hadoop-site.xml, right? Move them to mapred-default.xml. Any property 
that you put into hadoop-site.xml will override all, even job-specific 
settings.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com