You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Avi Vaknin <av...@gmail.com> on 2011/08/21 13:57:16 UTC

Hadoop cluster optimization

Hi all !
How are you?

My name is Avi and I have been fascinated by Apache Hadoop for the last few
months.
I am spending the last two weeks trying to optimize my configuration files
and environment.
I have been going through many Hadoop's configuration properties and it
seems that none
of them is  making a big difference (+- 3 minutes of a total job run time).

In Hadoop's meanings my cluster considered to be extremely small (260 GB of
text files, while every job is going through only +- 8 GB).
I have one server acting as "NameNode and JobTracker", and another 5 servers
acting as "DataNodes and TaskTreckers".
Right now Hadoop's configurations are set to default, beside the DFS Block
Size which is set to 256 MB since every file on my cluster takes 155 - 250
MB.

All of the above servers are exactly the same and having the following
hardware and software:
1.7 GB memory
1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
Ubuntu Server 10.10 , 32-bit platform
Cloudera CDH3 Manual Hadoop Installation
(for the ones who are familiar with Amazon Web Services, I am talking about
Small EC2 Instances/Servers)

Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
MB and 10 reduce tasks).

Based on the above information, does anyone can recommend on a best practice
configuration??
Do you thinks that when dealing with such a small cluster, and when
processing such a small amount of data,
is it even possible to optimize jobs so they would run much faster? 

By the way, it seems like none of the nodes are having a hardware
performance issues (CPU/Memory) while running the job.
Thats true unless I am having a bottle neck somewhere else (seems like
network bandwidth is not the issue).
That issue is a little confusing because  the NameNode process and the
JobTracker process should allocate 1GB of memory each,
which means that my hardware starting point is insufficient and in that case
why am I not seeing a full Memory utilization using 'top' 
command on the NameNode & JobTracker Server? 
How would you recommend to measure/monitor different Hadoop's properties to
find out where is the bottle neck?

Thanks for your help!!

Avi

RE: Hadoop cluster optimization

Posted by st...@emc.com.

Hi Avi,

I'm also learning Hadoop now. There's a tool named "nmon" that can track the usage of the server. You can use this to track the mem, cpu, disk and network usage of the servers. It's very easy to use and there's a nmon-analyzer that can generate excel diagrams base on the nmon data.

Hope this helps



-----Original Message-----
From: Avi Vaknin [mailto:avivaknin13@gmail.com] 
Sent: 2011年8月21日 快下班了 7:57
To: common-user@hadoop.apache.org
Subject: Hadoop cluster optimization

Hi all !
How are you?

My name is Avi and I have been fascinated by Apache Hadoop for the last few
months.
I am spending the last two weeks trying to optimize my configuration files
and environment.
I have been going through many Hadoop's configuration properties and it
seems that none
of them is  making a big difference (+- 3 minutes of a total job run time).

In Hadoop's meanings my cluster considered to be extremely small (260 GB of
text files, while every job is going through only +- 8 GB).
I have one server acting as "NameNode and JobTracker", and another 5 servers
acting as "DataNodes and TaskTreckers".
Right now Hadoop's configurations are set to default, beside the DFS Block
Size which is set to 256 MB since every file on my cluster takes 155 - 250
MB.

All of the above servers are exactly the same and having the following
hardware and software:
1.7 GB memory
1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
Ubuntu Server 10.10 , 32-bit platform
Cloudera CDH3 Manual Hadoop Installation
(for the ones who are familiar with Amazon Web Services, I am talking about
Small EC2 Instances/Servers)

Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
MB and 10 reduce tasks).

Based on the above information, does anyone can recommend on a best practice
configuration??
Do you thinks that when dealing with such a small cluster, and when
processing such a small amount of data,
is it even possible to optimize jobs so they would run much faster? 

By the way, it seems like none of the nodes are having a hardware
performance issues (CPU/Memory) while running the job.
Thats true unless I am having a bottle neck somewhere else (seems like
network bandwidth is not the issue).
That issue is a little confusing because  the NameNode process and the
JobTracker process should allocate 1GB of memory each,
which means that my hardware starting point is insufficient and in that case
why am I not seeing a full Memory utilization using 'top' 
command on the NameNode & JobTracker Server? 
How would you recommend to measure/monitor different Hadoop's properties to
find out where is the bottle neck?

Thanks for your help!!

Avi

Re: Hadoop cluster optimization

Posted by Allen Wittenauer <aw...@apache.org>.

On Aug 22, 2011, at 3:00 AM, אבי ווקנין wrote:
> I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
> why I am trying to change it now.
> 
> I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
> replaced them with
> 
> 2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

	This should allow you to bump up the memory and/or increase the task count.

> Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?

	I've run several instance where the NN was 64-bit and the DNs were 32-bit.  I can't think of a reason the reverse wouldn't work.  The only thing that is really going to matter is if they are the same CPU architecture. (which, if you are running on EC2, will likely always be the case).   

> Based on the new hardware I'm using, Are there any suggestions regarding the
> Hadoop configuration parameters?

	It really depends upon how much memory you need per task.  Thus why task spill rate is important... :)

> One more thing, you asked: "Are your tasks spilling?"
> 
> How can I check if my tasks spilling ?

	Check the task logs.  

	If you aren't spilling, then you'll likely want to match task count=core count-1 unless mem is exhausted first. (i.e., tasks*mem should be < avail mem).

RE: Hadoop cluster optimization

Posted by Ian Michael Gumby <mi...@hotmail.com>.

Avi, 
You can run with 1.7GB of RAM, but that means you're going to have 1 m/r slot per node.

With 4 cores, figure 1 core for DN, 1 core for TT and then w hyper threading 2 threads per core means 4 virtual cores and then you could run with 4 slots per. (3 mappers 1 reducer)
So that would be 1 + 1 + 4 GB assuming your jobs are 1GB or less in heap.
Note this leaves some head room on the machine for OS jobs and later fine tuning so that you're not over committing your machines.
Also if your files are 256MB in size and your block size is 256MB, you get one slot per file. No parallelism unless you're processing multiple files in a M/R job.

As to running 32 bit JT/NN and 64bit DN/TT, I haven't tried this, and I don't think you'll have a problem, but I wouldn't recommend it. You're adding a variable that you don't need and I'm not sure of the cost difference is worth the risk. (You said you're running this on EC2... ) Note: I'm not an expert on EC2. I guess I'm lucky that I have my own sandbox to play in... :-)

HTH

-Mike


> Date: Mon, 22 Aug 2011 13:00:27 +0300
> Subject: Re: Hadoop cluster optimization
> From: avivaknin13@gmail.com
> To: common-user@hadoop.apache.org
> 
> Hi Allen/Michel ,
> 
> 
> First, thanks a lot for your reply.
> 
> I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
> why I am trying to change it now.
> 
> I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
> replaced them with
> 
> 2 datanodes with 7.5GB RAM (Amazon EC2 large instance).
> 
> Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
> Based on the new hardware I'm using, Are there any suggestions regarding the
> Hadoop
> 
> configuration parameters?
> 
> One more thing, you asked: "Are your tasks spilling?"
> 
> How can I check if my tasks spilling ?
> 
> Thanks.
> 
> Avi
> 
> 
> 
> On Mon, Aug 22, 2011 at 12:55 PM, Avi Vaknin <av...@gmail.com> wrote:
> 
> 
> > Hi Allen/Michel ,
> > First, thanks a lot for your reply.
> >
> > I assumed that the 1.7GB RAM will be the bottleneck in my environment
> > that's
> > why
> > I am trying to change it now.
> > I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
> > replaced them with
> > 2 datanodes with 7.5GB RAM (Amazon EC2 large instance).
> >
> > Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
> > Based on the new hardware I'm using, Are there any suggestions regarding
> > the
> > Hadoop
> > configuration parameters?
> >
> > One more thing, you asked: "Are your tasks spilling?"
> > How can I check if my tasks spilling ?
> >
> > Thanks.
> >
> > Avi
> >
> >
> > -----Original Message-----
> > From: Allen Wittenauer [mailto:aw@apache.org]
> > Sent: Monday, August 22, 2011 7:06 AM
> > To: common-user@hadoop.apache.org
> >
> > Subject: Re: Hadoop cluster optimization
> >
> >
> > On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:
> >
> > > Avi,
> > > First why 32 bit OS?
> > > You have a 64 bit processor that has 4 cores hyper threaded looks like
> > 8cpus.
> >
> >        With only 1.7gb of mem, there likely isn't much of a reason to use a
> > 64-bit OS.  The machines (as you point out) are already tight on memory.
> > 64-bit is only going to make it worse.
> >
> > >>
> > >> 1.7 GB memory
> > >> 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
> > >> Ubuntu Server 10.10 , 32-bit platform
> > >> Cloudera CDH3 Manual Hadoop Installation
> > >> (for the ones who are familiar with Amazon Web Services, I am talking
> > about
> > >> Small EC2 Instances/Servers)
> > >>
> > >> Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to
> > 250
> > >> MB and 10 reduce tasks).
> > >>
> > >> Based on the above information, does anyone can recommend on a best
> > practice
> > >> configuration??
> >
> >        How many spindles?  Are your tasks spilling?
> >
> >
> > >> Do you thinks that when dealing with such a small cluster, and when
> > >> processing such a small amount of data,
> > >> is it even possible to optimize jobs so they would run much faster?
> >
> >        Most of the time, performance issues are with the algorithm, not
> > Hadoop.
> >
> >
> >
> > -----
> > No virus found in this message.
> > Checked by AVG - www.avg.com
> > Version: 10.0.1392 / Virus Database: 1520/3848 - Release Date: 08/21/11
> >
> >
> >

Re: Hadoop cluster optimization

Posted by אבי ווקנין <av...@gmail.com>.

Hi Allen/Michel ,


First, thanks a lot for your reply.

I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
why I am trying to change it now.

I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
replaced them with

2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
Based on the new hardware I'm using, Are there any suggestions regarding the
Hadoop

configuration parameters?

One more thing, you asked: "Are your tasks spilling?"

How can I check if my tasks spilling ?

Thanks.

Avi



On Mon, Aug 22, 2011 at 12:55 PM, Avi Vaknin <av...@gmail.com> wrote:


> Hi Allen/Michel ,
> First, thanks a lot for your reply.
>
> I assumed that the 1.7GB RAM will be the bottleneck in my environment
> that's
> why
> I am trying to change it now.
> I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
> replaced them with
> 2 datanodes with 7.5GB RAM (Amazon EC2 large instance).
>
> Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
> Based on the new hardware I'm using, Are there any suggestions regarding
> the
> Hadoop
> configuration parameters?
>
> One more thing, you asked: "Are your tasks spilling?"
> How can I check if my tasks spilling ?
>
> Thanks.
>
> Avi
>
>
> -----Original Message-----
> From: Allen Wittenauer [mailto:aw@apache.org]
> Sent: Monday, August 22, 2011 7:06 AM
> To: common-user@hadoop.apache.org
>
> Subject: Re: Hadoop cluster optimization
>
>
> On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:
>
> > Avi,
> > First why 32 bit OS?
> > You have a 64 bit processor that has 4 cores hyper threaded looks like
> 8cpus.
>
>        With only 1.7gb of mem, there likely isn't much of a reason to use a
> 64-bit OS.  The machines (as you point out) are already tight on memory.
> 64-bit is only going to make it worse.
>
> >>
> >> 1.7 GB memory
> >> 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
> >> Ubuntu Server 10.10 , 32-bit platform
> >> Cloudera CDH3 Manual Hadoop Installation
> >> (for the ones who are familiar with Amazon Web Services, I am talking
> about
> >> Small EC2 Instances/Servers)
> >>
> >> Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to
> 250
> >> MB and 10 reduce tasks).
> >>
> >> Based on the above information, does anyone can recommend on a best
> practice
> >> configuration??
>
>        How many spindles?  Are your tasks spilling?
>
>
> >> Do you thinks that when dealing with such a small cluster, and when
> >> processing such a small amount of data,
> >> is it even possible to optimize jobs so they would run much faster?
>
>        Most of the time, performance issues are with the algorithm, not
> Hadoop.
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1392 / Virus Database: 1520/3848 - Release Date: 08/21/11
>
>
>

RE: Hadoop cluster optimization

Posted by Avi Vaknin <av...@gmail.com>.

Hi Allen/Michel ,
First, thanks a lot for your reply.

I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
why 
I am trying to change it now.
I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
replaced them with 
2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
Based on the new hardware I'm using, Are there any suggestions regarding the
Hadoop
configuration parameters?        

One more thing, you asked: "Are your tasks spilling?"
How can I check if my tasks spilling ?

Thanks.

Avi

-----Original Message-----
From: Allen Wittenauer [mailto:aw@apache.org] 
Sent: Monday, August 22, 2011 7:06 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop cluster optimization

On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:

> Avi,
> First why 32 bit OS?
> You have a 64 bit processor that has 4 cores hyper threaded looks like
8cpus.

	With only 1.7gb of mem, there likely isn't much of a reason to use a
64-bit OS.  The machines (as you point out) are already tight on memory.
64-bit is only going to make it worse.

>> 
>> 1.7 GB memory
>> 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
>> Ubuntu Server 10.10 , 32-bit platform
>> Cloudera CDH3 Manual Hadoop Installation
>> (for the ones who are familiar with Amazon Web Services, I am talking
about
>> Small EC2 Instances/Servers)
>> 
>> Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to
250
>> MB and 10 reduce tasks).
>> 
>> Based on the above information, does anyone can recommend on a best
practice
>> configuration??

	How many spindles?  Are your tasks spilling?

>> Do you thinks that when dealing with such a small cluster, and when
>> processing such a small amount of data,
>> is it even possible to optimize jobs so they would run much faster? 

	Most of the time, performance issues are with the algorithm, not
Hadoop.

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1392 / Virus Database: 1520/3848 - Release Date: 08/21/11

Re: Hadoop cluster optimization

Posted by Allen Wittenauer <aw...@apache.org>.

On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:

> Avi,
> First why 32 bit OS?
> You have a 64 bit processor that has 4 cores hyper threaded looks like 8cpus.

	With only 1.7gb of mem, there likely isn't much of a reason to use a 64-bit OS.  The machines (as you point out) are already tight on memory.  64-bit is only going to make it worse.

>> 
>> 1.7 GB memory
>> 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
>> Ubuntu Server 10.10 , 32-bit platform
>> Cloudera CDH3 Manual Hadoop Installation
>> (for the ones who are familiar with Amazon Web Services, I am talking about
>> Small EC2 Instances/Servers)
>> 
>> Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
>> MB and 10 reduce tasks).
>> 
>> Based on the above information, does anyone can recommend on a best practice
>> configuration??

	How many spindles?  Are your tasks spilling?


>> Do you thinks that when dealing with such a small cluster, and when
>> processing such a small amount of data,
>> is it even possible to optimize jobs so they would run much faster? 

	Most of the time, performance issues are with the algorithm, not Hadoop.

Re: Hadoop cluster optimization

Posted by Michel Segel <mi...@hotmail.com>.

Avi,
First why 32 bit OS?
You have a 64 bit processor that has 4 cores hyper threaded looks like 8cpus.

With only 1.7 GB you're going to be limited on the number of slots you can configure. 
I'd say run ganglia but that would take resources away from you.  It sounds like the default parameters are a pretty good fit.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 21, 2011, at 6:57 AM, "Avi Vaknin" <av...@gmail.com> wrote:

> Hi all !
> How are you?
> 
> My name is Avi and I have been fascinated by Apache Hadoop for the last few
> months.
> I am spending the last two weeks trying to optimize my configuration files
> and environment.
> I have been going through many Hadoop's configuration properties and it
> seems that none
> of them is  making a big difference (+- 3 minutes of a total job run time).
> 
> In Hadoop's meanings my cluster considered to be extremely small (260 GB of
> text files, while every job is going through only +- 8 GB).
> I have one server acting as "NameNode and JobTracker", and another 5 servers
> acting as "DataNodes and TaskTreckers".
> Right now Hadoop's configurations are set to default, beside the DFS Block
> Size which is set to 256 MB since every file on my cluster takes 155 - 250
> MB.
> 
> All of the above servers are exactly the same and having the following
> hardware and software:
> 1.7 GB memory
> 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
> Ubuntu Server 10.10 , 32-bit platform
> Cloudera CDH3 Manual Hadoop Installation
> (for the ones who are familiar with Amazon Web Services, I am talking about
> Small EC2 Instances/Servers)
> 
> Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
> MB and 10 reduce tasks).
> 
> Based on the above information, does anyone can recommend on a best practice
> configuration??
> Do you thinks that when dealing with such a small cluster, and when
> processing such a small amount of data,
> is it even possible to optimize jobs so they would run much faster? 
> 
> By the way, it seems like none of the nodes are having a hardware
> performance issues (CPU/Memory) while running the job.
> Thats true unless I am having a bottle neck somewhere else (seems like
> network bandwidth is not the issue).
> That issue is a little confusing because  the NameNode process and the
> JobTracker process should allocate 1GB of memory each,
> which means that my hardware starting point is insufficient and in that case
> why am I not seeing a full Memory utilization using 'top' 
> command on the NameNode & JobTracker Server? 
> How would you recommend to measure/monitor different Hadoop's properties to
> find out where is the bottle neck?
> 
> Thanks for your help!!
> 
> Avi
> 
> 
>