You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by C G <pa...@yahoo.com> on 2008/02/20 18:30:58 UTC

Questions regarding configuration parameters...

Hi All:
   
  The documentation for the configuration parameters mapred.map.tasks and mapred.reduce.tasks discuss these  values in terms of “number of available hosts” in the grid.  This description strikes me as a bit odd given that a “host” could be anything from a uniprocessor to an N-way box, where values for N could vary from 2..16 or more.  The documentation is also vague about computing the actual value.  For example, for mapred.map.tasks the doc says “…a prime number several times greater…”.  I’m curious about how people are interpreting the descriptions and what values people are using.  Specifically, I’m wondering if I should be using “core count” instead of “host count” to set these values.
   
  In the specific case of my system, we have 24 hosts where each host is a 4-way system (i.e. 96 cores total).  For mapred.map.tasks I chose the value 173, as that is a prime number which is near 7*24.  For mapred.reduce.tasks I chose 23 since that is a prime number close to 24.  Is this what was intended?
   
  Beyond curiousity, I’m concerned about setting these values and other configuration parameters correctly because I am pursuing some performance issues where it is taking a very long time to process small amounts of data.  I am hoping that some amount of tuning will resolve the problems.
   
  Any thoughts and insights most appreciated.
   
  Thanks,
  C G
   

       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

RE: Questions regarding configuration parameters...

Posted by C G <pa...@yahoo.com>.
Guys:
   
  Thanks for the information...I've gotten some pretty good results twiddling some parameters.  I've also reminded myself about the pitfalls of oversubscribing resources (like number of reducers).  Here's what I learned, written up here to hopefully help somebody later...
   
  I set up one of my apps on a 4-node test grid.  Each grid member is a 4-way box.  The configuration had default values (2) for mapred.tasktracker.(map,reduce).tasks.maximum.  The values for mapred.map.tasks and mapred.reduce.tasks were 29 and 3 respectively (using the prime number recommendations in the docs).
   
  The initial run took 00:23:21...not so good.  I changed (map,reduce).tasks.maximum to 4 and the time fell to 19:40.  Then I tried 7 and it fell to 14:37.  So far so good.
   
  I then looked at my code and realized that I was specifying 32 for the number of reducers (damned hard-coded constants...I bop myself on the head and call myself a moron).  The large value was based on running on a much larger grid.  
   
  So I backed that value down to 3, and my execution time fell to 09:17.  Then I changed (map,reduce).tasks.maximum from 7 to 4 and ran again in 06:48.  w00t!
   
  Bottom line:  Carefully setting configuration parameters, and paying attention to map/reduce task values relative to the size of the grid is VERY important in achieving good performance.
   
  Thanks,
  C G

Joydeep Sen Sarma <js...@facebook.com> wrote:
  
> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.

that's 2 each for map and reduce. so theoretically - one could fully utilize a 4 core box with this setting. in practice - a little bit of oversubscription (3 each on a 4 core) seems to be working out well for us (maybe overlapping some compute and io - but mostly we are trading off for higher # concurrent jobs against per job latency).

unlikely that these settings are causing slowness in processing small amounts of data. send more details - what's slow (map/shuffle/reduce)? check cpu consumption when map task is running .. etc.


-----Original Message-----
From: Andy Li [mailto:annndy.lee@gmail.com]
Sent: Thu 2/21/2008 2:36 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions regarding configuration parameters...

Try the 2 parameters to utilize all the cores per node/host.



mapred.tasktracker.map.tasks.maximum
7
The maximum number of map tasks that will be run
simultaneously by a task tracker.






mapred.tasktracker.reduce.tasks.maximum
7
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.




The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G 
wrote:

> Hi All:
>
> The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these values in terms of "number of available
> hosts" in the grid. This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more. The documentation is also vague about
> computing the actual value. For example, for mapred.map.tasks the doc
> says ".a prime number several times greater.". I'm curious about how people
> are interpreting the descriptions and what values people are using.
> Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
> In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24. For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
> Is this what was intended?
>
> Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
> I am hoping that some amount of tuning will resolve the problems.
>
> Any thoughts and insights most appreciated.
>
> Thanks,
> C G
>
>
>
> ---------------------------------
> Never miss a thing. Make Yahoo your homepage.
>



       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

RE: Questions regarding configuration parameters...

Posted by Tim Wintle <ti...@teamrubber.com>.
I have had exactly the same problem with using the command line to cat
files - they can take for ages, although I don't know why. Network
utilisation does not seem to be the bottleneck, though.

(Running 0.15.3)

Is the slow part of the reduce while you are waiting for the map data to
copy over to the reducers? I believe there was a bug prior to 0.16.0
that could leave you waiting for a long time if mappers had been too
slow to respond to previous requests (even if they were completely free
now)


On Thu, 2008-02-21 at 21:51 -0800, C G wrote:
> My performance problems fall into 2 categories:
>    
>   1.  Extremely slow reduce phases - our map phases march along at impressive speed, but during reduce phases most nodes go idle...the active machines mostly clunk along at 10-30% CPU.  Compare this to the map phase where I get all grid nodes cranking away at > 100% CPU.  This is a vague explanation I realize.
>    
>   2.  Pregnant pauses during dfs -copyToLocal and -cat operations.  Frequently I'll be iterating over a list of HDFS files cat-ing them into one file to bulk load into a database.  Many times I'll see one of the copies/cats sit for anywhere from 2-5 minutes.  During that time no data is transferred, all nodes are idle, and absolutely nothing is written to any of the logs.  The file sizes being copied are relatively small...less than 1G each in most cases.
>    
>   Both of these issues persist in 0.16.0 and definitely have me puzzled.  I'm sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but the long pauses during a dfs command line operation seems like a bug to me.  Unfortunately I've not seen anybody else report this.
>    
>   Any thoughts/ideas most welcome...
>    
>   Thanks,
>   C G
>   
> 
> Joydeep Sen Sarma <js...@facebook.com> wrote:
>   
> > The default value are 2 so you might only see 2 cores used by Hadoop per
> > node/host.
> 
> that's 2 each for map and reduce. so theoretically - one could fully utilize a 4 core box with this setting. in practice - a little bit of oversubscription (3 each on a 4 core) seems to be working out well for us (maybe overlapping some compute and io - but mostly we are trading off for higher # concurrent jobs against per job latency).
> 
> unlikely that these settings are causing slowness in processing small amounts of data. send more details - what's slow (map/shuffle/reduce)? check cpu consumption when map task is running .. etc.
> 
> 
> -----Original Message-----
> From: Andy Li [mailto:annndy.lee@gmail.com]
> Sent: Thu 2/21/2008 2:36 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Questions regarding configuration parameters...
> 
> Try the 2 parameters to utilize all the cores per node/host.
> 
> 
> 
> mapred.tasktracker.map.tasks.maximum
> 7
> The maximum number of map tasks that will be run
> simultaneously by a task tracker.
> 
> 
> 
> 
> 
> 
> mapred.tasktracker.reduce.tasks.maximum
> 7
> The maximum number of reduce tasks that will be run
> simultaneously by a task tracker.
> 
> 
> 
> 
> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.
> If each system/machine has 4 cores (dual dual core), then you can change
> them to 3.
> 
> Hope this works for you.
> 
> -Andy
> 
> 
> On Wed, Feb 20, 2008 at 9:30 AM, C G 
> wrote:
> 
> > Hi All:
> >
> > The documentation for the configuration parameters mapred.map.tasks and
> > mapred.reduce.tasks discuss these values in terms of "number of available
> > hosts" in the grid. This description strikes me as a bit odd given that a
> > "host" could be anything from a uniprocessor to an N-way box, where values
> > for N could vary from 2..16 or more. The documentation is also vague about
> > computing the actual value. For example, for mapred.map.tasks the doc
> > says ".a prime number several times greater.". I'm curious about how people
> > are interpreting the descriptions and what values people are using.
> > Specifically, I'm wondering if I should be using "core count" instead of
> > "host count" to set these values.
> >
> > In the specific case of my system, we have 24 hosts where each host is a
> > 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
> > value 173, as that is a prime number which is near 7*24. For
> > mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
> > Is this what was intended?
> >
> > Beyond curiousity, I'm concerned about setting these values and other
> > configuration parameters correctly because I am pursuing some performance
> > issues where it is taking a very long time to process small amounts of data.
> > I am hoping that some amount of tuning will resolve the problems.
> >
> > Any thoughts and insights most appreciated.
> >
> > Thanks,
> > C G
> >
> >
> >
> > ---------------------------------
> > Never miss a thing. Make Yahoo your homepage.
> >
> 
> 
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.


RE: Questions regarding configuration parameters...

Posted by C G <pa...@yahoo.com>.
My performance problems fall into 2 categories:
   
  1.  Extremely slow reduce phases - our map phases march along at impressive speed, but during reduce phases most nodes go idle...the active machines mostly clunk along at 10-30% CPU.  Compare this to the map phase where I get all grid nodes cranking away at > 100% CPU.  This is a vague explanation I realize.
   
  2.  Pregnant pauses during dfs -copyToLocal and -cat operations.  Frequently I'll be iterating over a list of HDFS files cat-ing them into one file to bulk load into a database.  Many times I'll see one of the copies/cats sit for anywhere from 2-5 minutes.  During that time no data is transferred, all nodes are idle, and absolutely nothing is written to any of the logs.  The file sizes being copied are relatively small...less than 1G each in most cases.
   
  Both of these issues persist in 0.16.0 and definitely have me puzzled.  I'm sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but the long pauses during a dfs command line operation seems like a bug to me.  Unfortunately I've not seen anybody else report this.
   
  Any thoughts/ideas most welcome...
   
  Thanks,
  C G
  

Joydeep Sen Sarma <js...@facebook.com> wrote:
  
> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.

that's 2 each for map and reduce. so theoretically - one could fully utilize a 4 core box with this setting. in practice - a little bit of oversubscription (3 each on a 4 core) seems to be working out well for us (maybe overlapping some compute and io - but mostly we are trading off for higher # concurrent jobs against per job latency).

unlikely that these settings are causing slowness in processing small amounts of data. send more details - what's slow (map/shuffle/reduce)? check cpu consumption when map task is running .. etc.


-----Original Message-----
From: Andy Li [mailto:annndy.lee@gmail.com]
Sent: Thu 2/21/2008 2:36 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions regarding configuration parameters...

Try the 2 parameters to utilize all the cores per node/host.



mapred.tasktracker.map.tasks.maximum
7
The maximum number of map tasks that will be run
simultaneously by a task tracker.






mapred.tasktracker.reduce.tasks.maximum
7
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.




The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G 
wrote:

> Hi All:
>
> The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these values in terms of "number of available
> hosts" in the grid. This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more. The documentation is also vague about
> computing the actual value. For example, for mapred.map.tasks the doc
> says ".a prime number several times greater.". I'm curious about how people
> are interpreting the descriptions and what values people are using.
> Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
> In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24. For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
> Is this what was intended?
>
> Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
> I am hoping that some amount of tuning will resolve the problems.
>
> Any thoughts and insights most appreciated.
>
> Thanks,
> C G
>
>
>
> ---------------------------------
> Never miss a thing. Make Yahoo your homepage.
>



       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

RE: Questions regarding configuration parameters...

Posted by Joydeep Sen Sarma <js...@facebook.com>.
> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.

that's 2 each for map and reduce. so theoretically - one could fully utilize a 4 core box with this setting. in practice - a little bit of oversubscription (3 each on a 4 core) seems to be working out well for us (maybe overlapping some compute and io - but mostly we are trading off for higher # concurrent jobs against per job latency).

unlikely that these settings are causing slowness in processing small amounts of data. send more details - what's slow (map/shuffle/reduce)? check cpu consumption when map task is running .. etc.


-----Original Message-----
From: Andy Li [mailto:annndy.lee@gmail.com]
Sent: Thu 2/21/2008 2:36 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions regarding configuration parameters...
 
Try the 2 parameters to utilize all the cores per node/host.

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>7</value>
  <description>The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>7</value>
  <description>The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G <pa...@yahoo.com> wrote:

> Hi All:
>
>  The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these  values in terms of "number of available
> hosts" in the grid.  This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more.  The documentation is also vague about
> computing the actual value.  For example, for mapred.map.tasks the doc
> says ".a prime number several times greater.".  I'm curious about how people
> are interpreting the descriptions and what values people are using.
>  Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
>  In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total).  For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24.  For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
>  Is this what was intended?
>
>  Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
>  I am hoping that some amount of tuning will resolve the problems.
>
>  Any thoughts and insights most appreciated.
>
>  Thanks,
>   C G
>
>
>
> ---------------------------------
> Never miss a thing.   Make Yahoo your homepage.
>


Re: Questions regarding configuration parameters...

Posted by Andy Li <an...@gmail.com>.
Try the 2 parameters to utilize all the cores per node/host.

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>7</value>
  <description>The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>7</value>
  <description>The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G <pa...@yahoo.com> wrote:

> Hi All:
>
>  The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these  values in terms of "number of available
> hosts" in the grid.  This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more.  The documentation is also vague about
> computing the actual value.  For example, for mapred.map.tasks the doc
> says "…a prime number several times greater…".  I'm curious about how people
> are interpreting the descriptions and what values people are using.
>  Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
>  In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total).  For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24.  For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
>  Is this what was intended?
>
>  Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
>  I am hoping that some amount of tuning will resolve the problems.
>
>  Any thoughts and insights most appreciated.
>
>  Thanks,
>   C G
>
>
>
> ---------------------------------
> Never miss a thing.   Make Yahoo your homepage.
>