You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Oleg Ruchovets <or...@gmail.com> on 2012/10/02 00:02:21 UTC

Which hardware to choose

Hi ,
  We are on a very early stage of our hadoop project and want to do a POC.

We have ~ 5-6 terabytes of row data and we are going to execute some
aggregations.

We plan to use  8 - 10 machines

Questions:

  1)  Which hardware should we use:
    a) How many discs , what discs is better to use?
    b) How many RAM?
    c) How many CPUs?


   2) Please share best practices and tips / tricks related to utilise
hardware using for hadoop projects.

Thanks in advance
Oleg.

Re: Which hardware to choose

Posted by Michael Segel <mi...@hotmail.com>.
Ah that's the $64,000.00 (USD) question.... 

I tend to be conservative so this should be a good starting point. 

You start with 2 things... the amount of memory available and the number of physical cores. 

Subtract a core for each main process. e.g. DN, TT, and RS if you're running HBase. 
Take the remaining cores and if you're running on INTEL w HyperThreading multiply them by 2. 
That's the max number of slots you should use when configuring Hadoop. 

Note: For each slot, you should have at least 1GB of memory. 
You may want to plan on 2GB so your child opts can go up to 2GB before reducing the number of slots.

So if you have dual hexa-core and run HBase...  it looks like the following:
12 cores less DN, TT, and RS = 9 cores. * 2 so you have 18 slots that can be a mix of Mappers and Reducers. 

That's a good starting position and you can ramp it up based on what you observe. 

YMMV of course. 

Note: When I run HBase, I don't want any swapping. So you have to pay attention to the amount of memory on the system and how its being allocated. 

:-)


On Oct 2, 2012, at 8:57 PM, Marcos Ortiz <ml...@uci.cu> wrote:

> Which is a reasonable number in this hardware?
> 
> On 10/02/2012 09:40 PM, Michael Segel wrote:
>> I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. 
>> Especially if they want to run HBase. 
>> 
>> On Oct 2, 2012, at 8:14 PM, hadoopman <ha...@gmail.com> wrote:
>> 
>>> Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  Sounds VERY low for a cluster that size.
>>> 
>>> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance.
>>> 
>>> 
>>> 
>>> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
>>>> 38 data nodes + 2 Name Nodes
>>>>>  >
>>>>>  >  Data Node:
>>>>>  >  Dell PowerEdge C2100 series
>>>>>  >  2 x XEON x5670
>>>>>  >  48 GB RAM ECC  (12x4GB 1333MHz)
>>>>>  >  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
>>>>>  >  Intel Gigabit ET Dual port PCIe x4
>>>>>  >  Redundant Power Supply
>>>>>  >  Hadoop CDH3
>>>>>  >  max map tasks 24
>>>>>  >  max reduce tasks 8
>>> 
>> 
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>> 
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci
> 
> -- 
> Marcos Luis Ortíz Valmaseda
> Data Engineer && Sr. System Administrator at UCI
> about.me/marcosortiz
> My Blog
> Tumblr's blog
> @marcosluis2186 
> 
>  
> 


Re: Which hardware to choose

Posted by Marcos Ortiz <ml...@uci.cu>.
Which is a reasonable number in this hardware?

On 10/02/2012 09:40 PM, Michael Segel wrote:
> I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers.
> Especially if they want to run HBase.
>
> On Oct 2, 2012, at 8:14 PM, hadoopman <ha...@gmail.com> wrote:
>
>> Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  Sounds VERY low for a cluster that size.
>>
>> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance.
>>
>>
>>
>> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
>>> 38 data nodes + 2 Name Nodes
>>>>   >
>>>>   >  Data Node:
>>>>   >  Dell PowerEdge C2100 series
>>>>   >  2 x XEON x5670
>>>>   >  48 GB RAM ECC  (12x4GB 1333MHz)
>>>>   >  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
>>>>   >  Intel Gigabit ET Dual port PCIe x4
>>>>   >  Redundant Power Supply
>>>>   >  Hadoop CDH3
>>>>   >  max map tasks 24
>>>>   >  max reduce tasks 8
>>
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer && Sr. System Administrator at UCI*
about.me/marcosortiz <http://about.me/marcosortiz>
My Blog <http://marcosluis2186.posterous.com>
Tumblr's blog <http://marcosortiz.tumblr.com/>
@marcosluis2186 <http://twitter.com/marcosluis2186>



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Which hardware to choose

Posted by Michael Segel <mi...@hotmail.com>.
Well... 

If you're not running HBase, you're less harmed by minimal swapping so you could push the number of slots and over subscribe. 
The only thing I would have to suggest is that you monitor your system closely as you adjust the number of slots.

You have to admit though, its fun to tune the cluster. :-)

On Oct 3, 2012, at 12:09 PM, J. Rottinghuis <jr...@gmail.com> wrote:

> Of course it all depends...
> But something like this could work:
> 
> Leave 1-2 GB for the kernel, pagecache, tools, overhead etc.
> Plan 3-4 GB for Datanode and Tasktracker each
> 
> Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more
> or less memory per slot.
> Have 2-3 times as many mappers as reducers (depending on the kinds of jobs
> you run).
> 
> As Micheal pointed out the ratio of cores (hyperthreads) per disk matters.
> 
> With those initial rules of thumb you'd arrive somewhere between
> 10 mappers + 5 reducers
> and
> 9 mappers + 4 reducers
> 
> Try, test, measure, adjust, rinse, repeat.
> 
> Cheers,
> 
> Joep
> 
> On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov <ap...@gmail.com>wrote:
> 
>> All configs are per node.
>> No HBase, only Hive and Pig installed
>> 
>> On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel <michael_segel@hotmail.com
>>> wrote:
>> 
>>> I think he's saying that its 24 maps 8 reducers per node and at 48GB that
>>> could be too many mappers.
>>> Especially if they want to run HBase.
>>> 
>>> On Oct 2, 2012, at 8:14 PM, hadoopman <ha...@gmail.com> wrote:
>>> 
>>>> Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
>>> right?  Sounds VERY low for a cluster that size.
>>>> 
>>>> We have only 10 c2100's and are running I believe 140 map and 70 reduce
>>> slots so far with pretty decent performance.
>>>> 
>>>> 
>>>> 
>>>> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
>>>>> 38 data nodes + 2 Name Nodes
>>>>>>> 
>>>>>>> Data Node:
>>>>>>> Dell PowerEdge C2100 series
>>>>>>> 2 x XEON x5670
>>>>>>> 48 GB RAM ECC  (12x4GB 1333MHz)
>>>>>>> 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
>>>>>>> Intel Gigabit ET Dual port PCIe x4
>>>>>>> Redundant Power Supply
>>>>>>> Hadoop CDH3
>>>>>>> max map tasks 24
>>>>>>> max reduce tasks 8
>>>> 
>>>> 
>>> 
>>> 
>> 


Re: Which hardware to choose

Posted by "J. Rottinghuis" <jr...@gmail.com>.
Of course it all depends...
But something like this could work:

Leave 1-2 GB for the kernel, pagecache, tools, overhead etc.
Plan 3-4 GB for Datanode and Tasktracker each

Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more
or less memory per slot.
Have 2-3 times as many mappers as reducers (depending on the kinds of jobs
you run).

As Micheal pointed out the ratio of cores (hyperthreads) per disk matters.

With those initial rules of thumb you'd arrive somewhere between
10 mappers + 5 reducers
and
9 mappers + 4 reducers

Try, test, measure, adjust, rinse, repeat.

Cheers,

Joep

On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov <ap...@gmail.com>wrote:

> All configs are per node.
> No HBase, only Hive and Pig installed
>
> On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel <michael_segel@hotmail.com
> >wrote:
>
> > I think he's saying that its 24 maps 8 reducers per node and at 48GB that
> > could be too many mappers.
> > Especially if they want to run HBase.
> >
> > On Oct 2, 2012, at 8:14 PM, hadoopman <ha...@gmail.com> wrote:
> >
> > > Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
> > right?  Sounds VERY low for a cluster that size.
> > >
> > > We have only 10 c2100's and are running I believe 140 map and 70 reduce
> > slots so far with pretty decent performance.
> > >
> > >
> > >
> > > On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
> > >> 38 data nodes + 2 Name Nodes
> > >> >  >
> > >> >  >  Data Node:
> > >> >  >  Dell PowerEdge C2100 series
> > >> >  >  2 x XEON x5670
> > >> >  >  48 GB RAM ECC  (12x4GB 1333MHz)
> > >> >  >  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
> > >> >  >  Intel Gigabit ET Dual port PCIe x4
> > >> >  >  Redundant Power Supply
> > >> >  >  Hadoop CDH3
> > >> >  >  max map tasks 24
> > >> >  >  max reduce tasks 8
> > >
> > >
> >
> >
>

Re: Which hardware to choose

Posted by Alexander Pivovarov <ap...@gmail.com>.
All configs are per node.
No HBase, only Hive and Pig installed

On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel <mi...@hotmail.com>wrote:

> I think he's saying that its 24 maps 8 reducers per node and at 48GB that
> could be too many mappers.
> Especially if they want to run HBase.
>
> On Oct 2, 2012, at 8:14 PM, hadoopman <ha...@gmail.com> wrote:
>
> > Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
> right?  Sounds VERY low for a cluster that size.
> >
> > We have only 10 c2100's and are running I believe 140 map and 70 reduce
> slots so far with pretty decent performance.
> >
> >
> >
> > On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
> >> 38 data nodes + 2 Name Nodes
> >> >  >
> >> >  >  Data Node:
> >> >  >  Dell PowerEdge C2100 series
> >> >  >  2 x XEON x5670
> >> >  >  48 GB RAM ECC  (12x4GB 1333MHz)
> >> >  >  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
> >> >  >  Intel Gigabit ET Dual port PCIe x4
> >> >  >  Redundant Power Supply
> >> >  >  Hadoop CDH3
> >> >  >  max map tasks 24
> >> >  >  max reduce tasks 8
> >
> >
>
>

Re: Which hardware to choose

Posted by Michael Segel <mi...@hotmail.com>.
I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. 
Especially if they want to run HBase. 

On Oct 2, 2012, at 8:14 PM, hadoopman <ha...@gmail.com> wrote:

> Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  Sounds VERY low for a cluster that size.
> 
> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance.
> 
> 
> 
> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
>> 38 data nodes + 2 Name Nodes
>> >  >
>> >  >  Data Node:
>> >  >  Dell PowerEdge C2100 series
>> >  >  2 x XEON x5670
>> >  >  48 GB RAM ECC  (12x4GB 1333MHz)
>> >  >  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
>> >  >  Intel Gigabit ET Dual port PCIe x4
>> >  >  Redundant Power Supply
>> >  >  Hadoop CDH3
>> >  >  max map tasks 24
>> >  >  max reduce tasks 8
> 
> 


Re: Which hardware to choose

Posted by hadoopman <ha...@gmail.com>.
Had to ask :D


On 10/02/2012 07:19 PM, Russell Jurney wrote:
> I believe he means per node.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 2, 2012, at 6:15 PM, hadoopman<ha...@gmail.com>  wrote:
>
>> Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  Sounds VERY low for a cluster that size.
>>
>> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance.
>>
>>
>>
>> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
>>> 38 data nodes + 2 Name Nodes
>>>>   >
>>>>   >   Data Node:
>>>>   >   Dell PowerEdge C2100 series
>>>>   >   2 x XEON x5670
>>>>   >   48 GB RAM ECC  (12x4GB 1333MHz)
>>>>   >   12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
>>>>   >   Intel Gigabit ET Dual port PCIe x4
>>>>   >   Redundant Power Supply
>>>>   >   Hadoop CDH3
>>>>   >   max map tasks 24
>>>>   >   max reduce tasks 8


Re: Which hardware to choose

Posted by Russell Jurney <ru...@gmail.com>.
I believe he means per node.

Russell Jurney http://datasyndrome.com

On Oct 2, 2012, at 6:15 PM, hadoopman <ha...@gmail.com> wrote:

> Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  Sounds VERY low for a cluster that size.
>
> We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance.
>
>
>
> On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
>> 38 data nodes + 2 Name Nodes
>> >  >
>> >  >  Data Node:
>> >  >  Dell PowerEdge C2100 series
>> >  >  2 x XEON x5670
>> >  >  48 GB RAM ECC  (12x4GB 1333MHz)
>> >  >  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
>> >  >  Intel Gigabit ET Dual port PCIe x4
>> >  >  Redundant Power Supply
>> >  >  Hadoop CDH3
>> >  >  max map tasks 24
>> >  >  max reduce tasks 8
>

Re: Which hardware to choose

Posted by hadoopman <ha...@gmail.com>.
Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's 
right?  Sounds VERY low for a cluster that size.

We have only 10 c2100's and are running I believe 140 map and 70 reduce 
slots so far with pretty decent performance.



On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
> 38 data nodes + 2 Name Nodes
> >  >
> >  >  Data Node:
> >  >  Dell PowerEdge C2100 series
> >  >  2 x XEON x5670
> >  >  48 GB RAM ECC  (12x4GB 1333MHz)
> >  >  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
> >  >  Intel Gigabit ET Dual port PCIe x4
> >  >  Redundant Power Supply
> >  >  Hadoop CDH3
> >  >  max map tasks 24
> >  >  max reduce tasks 8


Re: Which hardware to choose

Posted by Alexander Pivovarov <ap...@gmail.com>.
Not sure

the following options are available
Integrated ICH10R on motherboard
LSI® 6Gb SAS2008 daughtercard
Dell PERC H200
Dell PERC H700
LSI MegaRAID® SAS 9260-8i

http://www.dell.com/us/enterprise/p/poweredge-c2100/pd

On Tue, Oct 2, 2012 at 10:59 AM, Oleg Ruchovets <or...@gmail.com>wrote:

> Great ,
>
> Thank you for the such detailed information,
>
> By the way what type of Disk Controller do you use?
>
> Thanks
> Oleg.
>
>
> On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov <apivovarov@gmail.com
> >wrote:
>
> > Privet Oleg
> >
> > Cloudera and Dell setup the following cluster for my company
> > Company receives 1.5 TB raw data per day
> >
> > 38 data nodes + 2 Name Nodes
> >
> > Data Node:
> > Dell PowerEdge C2100 series
> > 2 x XEON x5670
> > 48 GB RAM ECC  (12x4GB 1333MHz)
> > 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
> > Intel Gigabit ET Dual port PCIe x4
> > Redundant Power Supply
> > Hadoop CDH3
> > max map tasks 24
> > max reduce tasks 8
> >
> > Name Node and Secondary Name Node are the similar but
> > 96GB RAM  (not sure why)
> > 6x600Gb 15 RPM Serial SCSI
> > RAID10
> >
> >
> > another config is here
> > page 298
> >
> >
> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA298&lpg=PA298&dq=hadoop+jbod&source=bl&ots=i7xVQBPb_w&sig=8mhq-MtpkRcTiRB1ioKciMxIasg&hl=en&sa=X&ei=AGtqUMK6D8T10gHD4ICQAQ&ved=0CEMQ6AEwAg#v=onepage&q=hadoop%20jbod&f=false
> >
> >
> > you probably need just 1 computer with 10 x 2 TB SATA HDD
> >
> >
> >
> > On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets <or...@gmail.com>
> > wrote:
> >
> > > Hi ,
> > >   We are on a very early stage of our hadoop project and want to do a
> > POC.
> > >
> > > We have ~ 5-6 terabytes of row data and we are going to execute some
> > > aggregations.
> > >
> > > We plan to use  8 - 10 machines
> > >
> > > Questions:
> > >
> > >   1)  Which hardware should we use:
> > >     a) How many discs , what discs is better to use?
> > >     b) How many RAM?
> > >     c) How many CPUs?
> > >
> > >
> > >    2) Please share best practices and tips / tricks related to utilise
> > > hardware using for hadoop projects.
> > >
> > > Thanks in advance
> > > Oleg.
> > >
> >
>

Re: Which hardware to choose

Posted by Oleg Ruchovets <or...@gmail.com>.
Great ,

Thank you for the such detailed information,

By the way what type of Disk Controller do you use?

Thanks
Oleg.


On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov <ap...@gmail.com>wrote:

> Privet Oleg
>
> Cloudera and Dell setup the following cluster for my company
> Company receives 1.5 TB raw data per day
>
> 38 data nodes + 2 Name Nodes
>
> Data Node:
> Dell PowerEdge C2100 series
> 2 x XEON x5670
> 48 GB RAM ECC  (12x4GB 1333MHz)
> 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
> Intel Gigabit ET Dual port PCIe x4
> Redundant Power Supply
> Hadoop CDH3
> max map tasks 24
> max reduce tasks 8
>
> Name Node and Secondary Name Node are the similar but
> 96GB RAM  (not sure why)
> 6x600Gb 15 RPM Serial SCSI
> RAID10
>
>
> another config is here
> page 298
>
> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA298&lpg=PA298&dq=hadoop+jbod&source=bl&ots=i7xVQBPb_w&sig=8mhq-MtpkRcTiRB1ioKciMxIasg&hl=en&sa=X&ei=AGtqUMK6D8T10gHD4ICQAQ&ved=0CEMQ6AEwAg#v=onepage&q=hadoop%20jbod&f=false
>
>
> you probably need just 1 computer with 10 x 2 TB SATA HDD
>
>
>
> On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets <or...@gmail.com>
> wrote:
>
> > Hi ,
> >   We are on a very early stage of our hadoop project and want to do a
> POC.
> >
> > We have ~ 5-6 terabytes of row data and we are going to execute some
> > aggregations.
> >
> > We plan to use  8 - 10 machines
> >
> > Questions:
> >
> >   1)  Which hardware should we use:
> >     a) How many discs , what discs is better to use?
> >     b) How many RAM?
> >     c) How many CPUs?
> >
> >
> >    2) Please share best practices and tips / tricks related to utilise
> > hardware using for hadoop projects.
> >
> > Thanks in advance
> > Oleg.
> >
>

Re: Which hardware to choose

Posted by Alexander Pivovarov <ap...@gmail.com>.
Privet Oleg

Cloudera and Dell setup the following cluster for my company
Company receives 1.5 TB raw data per day

38 data nodes + 2 Name Nodes

Data Node:
Dell PowerEdge C2100 series
2 x XEON x5670
48 GB RAM ECC  (12x4GB 1333MHz)
12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
Intel Gigabit ET Dual port PCIe x4
Redundant Power Supply
Hadoop CDH3
max map tasks 24
max reduce tasks 8

Name Node and Secondary Name Node are the similar but
96GB RAM  (not sure why)
6x600Gb 15 RPM Serial SCSI
RAID10


another config is here
page 298
http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA298&lpg=PA298&dq=hadoop+jbod&source=bl&ots=i7xVQBPb_w&sig=8mhq-MtpkRcTiRB1ioKciMxIasg&hl=en&sa=X&ei=AGtqUMK6D8T10gHD4ICQAQ&ved=0CEMQ6AEwAg#v=onepage&q=hadoop%20jbod&f=false


you probably need just 1 computer with 10 x 2 TB SATA HDD



On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets <or...@gmail.com> wrote:

> Hi ,
>   We are on a very early stage of our hadoop project and want to do a POC.
>
> We have ~ 5-6 terabytes of row data and we are going to execute some
> aggregations.
>
> We plan to use  8 - 10 machines
>
> Questions:
>
>   1)  Which hardware should we use:
>     a) How many discs , what discs is better to use?
>     b) How many RAM?
>     c) How many CPUs?
>
>
>    2) Please share best practices and tips / tricks related to utilise
> hardware using for hadoop projects.
>
> Thanks in advance
> Oleg.
>