You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by dgoker <go...@gmail.com> on 2009/11/12 18:17:04 UTC

Help !! Hadoop installation to One machine has 24 CPU 16 disk (Each one 2 TB)

Hi

I installed the hadoop to one server which has following configurations

 24 CPU, 
 72 GB RAM
 17 Disk (2 TB)

 
All configuration belongs to Hadoop and Pig are is default settings. ın
order to run process efficiently waht should be the following configuration
settings. The settings i find on forums usually 4 CPU machines and clustered
system.


What do you suggest me following settings?


mapred.tasktracker.reduce.tasks.maximum   ?
mapred.map.tasks ?
mapred.reduce.tasks ?
dfs.datanode.handler.count ?



-- 
View this message in context: http://old.nabble.com/Help-%21%21-Hadoop-installation-to-One-machine-has-24-CPU-16-disk-%28Each-one-2-TB%29-tp26322618p26322618.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Help !! Hadoop installation to One machine has 24 CPU 16 disk (Each one 2 TB)

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Nov 12, 2009 at 12:17 PM, dgoker <go...@gmail.com> wrote:
>
> Hi
>
> I installed the hadoop to one server which has following configurations
>
>  24 CPU,
>  72 GB RAM
>  17 Disk (2 TB)
>
>
> All configuration belongs to Hadoop and Pig are is default settings. ın
> order to run process efficiently waht should be the following configuration
> settings. The settings i find on forums usually 4 CPU machines and clustered
> system.
>
>
> What do you suggest me following settings?
>
>
> mapred.tasktracker.reduce.tasks.maximum   ?
> mapred.map.tasks ?
> mapred.reduce.tasks ?
> dfs.datanode.handler.count ?
>
>
>
> --
> View this message in context: http://old.nabble.com/Help-%21%21-Hadoop-installation-to-One-machine-has-24-CPU-16-disk-%28Each-one-2-TB%29-tp26322618p26322618.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

I see that no one took a stab at this one yet so I will give it a go.
You have a very interesting configuration here that is not often seen
by hadoop.

Firstly, there are some subtle issues with single node deployments.
One issue is memory contention. (another is lack of ability to
replicate data)

The hadoop namenode wants to keep is i-node table in main memory. Your
task Trackers want to chew up lots of memory for its map/reduce jobs.
These two processes (and others) will fight for memory and the loser
will be everyone.

Your first consideration may be using ulimit or some other system for
resource control. You would want to use this on the datanode and
trasktracker to try and keep them under control.

You could accomplish more fine grained control with linux vserver,
http://www.linux-vserver.org/, or solaris zones, or (token virt
product here).

(I am guessing by 24 CPU you mean 24 cores?)
>From there you can do some off hand calculations like

2x cores NameNode 2GB-8GB RAM
2x cores JobTracker 2GB-8GB RAM
2x cores NameNode 2GB-8GB RAM

24core - 6core = 18 Core
72GB - 6GB|24GB = 66GB |  48 GB

So you have roughly 18 cores and 48 GB of ram to dedicate to DataNode
and TaskTracker. So punch these numbers into some other tuning guides
on the internet.

Usually
mapred.tasktracker.reduce.tasks.maximum
and
mapred.tasktracker.map.tasks.maximum

Are calculated on the number of cores a system has. There are a number
of formulas based on the guide you are reading.

You can also just plug the values into cloudera-configuration and see
what it comes up with.
http://www.cloudera.com/hadoop-config-faq

http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/

Configuration is more of iterative process and your workload changes
this. Depending on how much time you have to dev you should start as
simple as possible and evolve your configuration.

Re: I thought map and reduce could not overlap?

Posted by David Howell <de...@gmail.com>.

The first 2/3 of the reduce phase (as reported by the progress meters)
are all about getting the map results from the map tasktracker to the
reduce tasktracker and sorting them. The real reduce happens in the
last third, and that part won't start until all of the maps are done.

On Sat, Nov 14, 2009 at 10:05 AM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> I thought there was a barrier that ensured the map phase would finish before the reduce phase started but I see on the sample hadoop word count app:
>
> 09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
> 09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
> 09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
> 09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
> 09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
> 09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
> 09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
> 09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%
>
> That looks loke they are overlapping?
>
>
>
>
>

Re: I thought map and reduce could not overlap?

Posted by Tim Robertson <ti...@gmail.com>.

My understanding is the following:
As map tasks finish, it starts to pipe the output of the map to the
reducer machines, but it does not do the reduce yet.  During this
stage if you look at the running reducers, you will see it say
something like "copying 4 of 45".  Once all the maps have finished and
copied, you will see Reduce at 33%.  Once all the maps have finished,
the copying will finish afterwards, then the sorting, and then the
reduce starts.

Basically this overlap is just it beginning to copy the data that is
ready onto the reducer machines.

Cheers

Tim

On Sat, Nov 14, 2009 at 5:05 PM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> I thought there was a barrier that ensured the map phase would finish before the reduce phase started but I see on the sample hadoop word count app:
>
> 09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
> 09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
> 09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
> 09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
> 09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
> 09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
> 09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
> 09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%
>
> That looks loke they are overlapping?
>
>
>
>
>

Re: I thought map and reduce could not overlap?

Posted by Kevin Weil <ke...@gmail.com>.

The first third of the reduce phase is really the shuffle, where map  
outputs get sent to and collected at their respective refucers. You'll  
see this transfer happening, and the "reduce" creeping up towards 33%,  
towards the end of your map phase.  The 33% mark is where the real  
barrier is.

Kevin

On Nov 14, 2009, at 8:05 AM, Raymond Jennings III  
<ra...@yahoo.com> wrote:

> I thought there was a barrier that ensured the map phase would  
> finish before the reduce phase started but I see on the sample  
> hadoop word count app:
>
> 09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
> 09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
> 09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
> 09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
> 09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
> 09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
> 09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
> 09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%
>
> That looks loke they are overlapping?
>
>
>
>

I thought map and reduce could not overlap?

Posted by Raymond Jennings III <ra...@yahoo.com>.

I thought there was a barrier that ensured the map phase would finish before the reduce phase started but I see on the sample hadoop word count app:

09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%

That looks loke they are overlapping?

RE: Is the job tracker a master node?

Posted by zjffdu <zj...@gmail.com>.

The conf/master contains the second name node not master node (the file name
is a bit confusing)

You can configure your name node in core-site.xml and configure your job
tracker in mapred-site.xml


Jeff Zhang



-----Original Message-----
From: Raymond Jennings III [mailto:raymondjiii@yahoo.com] 
Sent: 2009年11月13日 9:05
To: common-user@hadoop.apache.org
Subject: Is the job tracker a master node?

I am running with the NameNode and JobTracker on separate machines.  Does
the JobTracker node need to be specified in the conf/master file?  I am not
running it as a slave node so I do not have it in the cond/slaves file.
Thanks!

Is the job tracker a master node?

Posted by Raymond Jennings III <ra...@yahoo.com>.

I am running with the NameNode and JobTracker on separate machines.  Does the JobTracker node need to be specified in the conf/master file?  I am not running it as a slave node so I do not have it in the cond/slaves file.  Thanks!