You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Patrick Angeles <pa...@gmail.com> on 2009/05/27 16:43:35 UTC

hadoop hardware configuration

Sorry for cross-posting, I realized I sent the following to the hbase list
when it's really more a Hadoop question.

---------- Forwarded message ----------
From: Patrick Angeles <pa...@gmail.com>
Date: Wed, May 27, 2009 at 9:50 AM
Subject: hadoop hardware configuration
To: hbase-user@hadoop.apache.org


Hey all,

I'm trying to find some up-to-date hardware advice for building a Hadoop
cluster. I've only been able to dig up the following links. Given Moore's
law, these are already out of date:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200811.mbox/%3CA47C361B-D19B-4A61-8DC1-41D4C0975EC8@cse.unl.edu%3E
http://wiki.apache.org/hadoop/MachineScaling
We expect to be taking in roughly 50GB of log data per day. In the early
going, we can choose to retain the logs for only a short period after
processing, so we can start with a small cluster (around 6 task nodes).
However, at some point, we will want to retain up to a year's worth of raw
data (~14TB per year).

We will likely be using Hive/Pig and Mahout for cluster analysis.

Given this, I'd like to run by the following machine specs to see what
everyone thinks:

2 x Hadoop Master (and Secondary NameNode)

   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
   - 16GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - Hardware RAID controller
   - Redundant Power Supply
   - Approx. 390W power draw (1.9amps 208V)
   - Approx. $4000 per unit

6 x Hadoop Task Nodes

   - 1 x 2.3Ghz Quad Core (Opteron 1356)
   - 8GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - No RAID (JBOD)
   - Non-Redundant Power Supply
   - Approx. 210W power draw (1.0amps 208V)
   - Approx. $2000 per unit

I had some specific questions regarding this configuration...

   1. Is hardware RAID necessary for the master node?
   2. What is a good processor-to-storage ratio for a task node with 4TB of
   raw storage? (The config above has 1 core per 1TB of raw storage.)
   3. Am I better off using dual quads for a task node, with a higher power
   draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200,
   but draws almost 2x as much power. The tradeoffs are:
      1. I will get more CPU per dollar and per watt.
      2. I will only be able to fit 1/2 as much dual quad machines into a
      rack.
      3. I will get 1/2 the storage capacity per watt.
      4. I will get less I/O throughput overall (less spindles per core)
   4. In planning storage capacity, how much spare disk space should I take
   into account for 'scratch'? For now, I'm assuming 1x the input data size.

Thanks in advance,

- P

Re: hadoop hardware configuration

Posted by Patrick Angeles <pa...@gmail.com>.
On Thu, May 28, 2009 at 6:02 AM, Steve Loughran <st...@apache.org> wrote:

> That really depends on the work you are doing...the bytes in/out to CPU
> work, and the size of any memory structures that are built up over the run.
>
> With 1 core per physical disk, you get the bandwidth of a single disk per
> CPU; for some IO-intensive work you can make the case for two disks/CPU -one
> in, one out, but then you are using more power, and if/when you want to add
> more storage, you have to pull out the disks to stick in new ones. If you go
> for more CPUs, you will probably need more RAM to go with it.
>

Just to throw a wrench in the works, Intel's Nehalem architecture takes DDR3
memory which are paired in 3's. So for a dual quad core rig, you can get
either 6 x 2GB (12GB) or, 6 x 4GB (24GB) for an extra $500. That's a big
step up in price for extra memory in a slave node. 12GB probably won't be
enough, because the mid-range Nehalems support hyper-threading, so you
actually get up to 16 threads running on a dual quad setup.


> Then there is the question of where your electricity comes from, what the
> limits for the room are, whether you are billed on power drawn or quoted PSU
> draw, what the HVAC limits are, what the maximum allowed weight per rack is,
> etc, etc.


We're going to start with cabinets in a co-location. Most can provide 40amps
per cabinet (with up to 80% load), so you could fit around 30 single-socket
servers, or 15 dual-socket servers in a single rack.


>
> I'm a fan of low Joule work, though we don't have any benchmarks yet of the
> power efficiency of different clusters; the number of MJ used to do a a
> terasort. I'm debating doing some single-cpu tests for this on my laptop, as
> the battery knows how much gets used up by some work.
>
>    4. In planning storage capacity, how much spare disk space should I take
>>   into account for 'scratch'? For now, I'm assuming 1x the input data
>> size.
>>
>
> That you should probably be able to determine on experimental work on
> smaller datasets. Some maps can throw out a lot of data, most reduces do
> actually reduce the final amount.
>
>
> -Steve
>
> (Disclaimer: I'm not making any official recommendations for hardware here,
> just making my opinions known. If you do want an official recommendation
> from HP, talk to your reseller or account manager, someone will look at your
> problem in more detail and make some suggestions. If you have any code/data
> that could be shared for benchmarking, that would help validate those
> suggestions)
>
>

Re: hadoop hardware configuration

Posted by Brian Bockelman <bb...@cse.unl.edu>.
On May 28, 2009, at 10:32 AM, Ian Soboroff wrote:

> Brian Bockelman <bb...@cse.unl.edu> writes:
>
>> Despite my trying, I've never been able to come even close to pegging
>> the CPUs on our NN.
>>
>> I'd recommend going for the fastest dual-cores which are affordable  
>> --
>> latency is king.
>
> Clue?
>
> Surely the latencies in Hadoop that dominate are not cured with faster
> processors, but with more RAM and faster disks?
>
> I've followed your posts for a while, so I know you are very  
> experienced
> with this stuff... help me out here.

Actually, that's more of a gut feeling than informed decision.   
Because the locking is rather coarse-grained, having many CPUs isn't  
going to win anything -- I'd rather any CPU-related portions to go as  
fast as possible.  Under the highest load, I think we've been able to  
get up to 25% CPU utilization: thus, I'm guessing any CPU-related  
improvements will come from faster ones, not more cores.

For my cluster, if I had a lot of money, I'd spend it on a hot-spare  
machine.  Then, I'd spend it on upgrading the RAM, followed by disks,  
followed by CPU.

Then again, for the cluster in the original email, I'd save money on  
the namenode and buy more datanodes.  We've got about 200 nodes and  
probably have a comparable NN.

Brian

Re: hadoop hardware configuration

Posted by Ian Soboroff <ia...@nist.gov>.
Brian Bockelman <bb...@cse.unl.edu> writes:

> Despite my trying, I've never been able to come even close to pegging
> the CPUs on our NN.
>
> I'd recommend going for the fastest dual-cores which are affordable -- 
> latency is king.

Clue?

Surely the latencies in Hadoop that dominate are not cured with faster
processors, but with more RAM and faster disks?

I've followed your posts for a while, so I know you are very experienced
with this stuff... help me out here.

Ian

Re: hadoop hardware configuration

Posted by stephen mulcahy <st...@deri.org>.
Brian Bockelman wrote:
> I'm not a hardware guy anymore, but I'd personally prefer a software 
> RAID.  I've seen mirrored disks go down because the RAID controller 
> decided to puke.

+1 on this. I've seen a number of hardware RAID failures in the last 2 
years and in each case the controller mangled the disks before finally 
giving up the ghost.

I'm very much inclined to prefer software RAID in light of this (and the 
fact that low-end RAID controllers have poor performance).

-stephen

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Re: hadoop hardware configuration

Posted by Brian Bockelman <bb...@cse.unl.edu>.
On May 28, 2009, at 2:00 PM, Patrick Angeles wrote:

> On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman <bbockelm@cse.unl.edu 
> >wrote:
>
>>
>> We do both -- push the disk image out to NFS and have a mirrored  
>> SAS hard
>> drives on the namenode.  The SAS drives appear to be overkill.
>>
>
> This sounds like a nice approach, taking into account hardware,  
> labor and
> downtime costs... $700 for a RAID controller seems reasonable to  
> minimize
> maintenance due to a disk failure. Alex's suggestion to go JBOD and  
> write to
> all volumes would work as well, but slightly more labor intensive.

Remember though that disk failure downtime is actually rather rare.   
The question is "how tight is your hardware budget": if $700 is worth  
the extra 1 day of uptime a year, then spend it.  I come from an  
academic background where (a) we don't lose money if things go down  
and (b) jobs move to another site in the US if things are down.  That  
perhaps gives you a reading into my somewhat relaxed attitude.

I'm not a hardware guy anymore, but I'd personally prefer a software  
RAID.  I've seen mirrored disks go down because the RAID controller  
decided to puke.

>
>
>>>  2. What is a good processor-to-storage ratio for a task node with  
>>> 4TB of
>>>> raw storage? (The config above has 1 core per 1TB of raw storage.)
>>>>
>>>
>>
>> We're data hungry locally -- I'd put in bigger hard drives.  The  
>> 1.5TB
>> Seagate drives seem to have passed their teething issues, and are  
>> at a
>> pretty sweet price point.  They only will scale up to 60 IOPS, so  
>> make sure
>> your workflows don't have lots of random I/O.
>>
>
> I haven't seen too many vendors offering the 1.5TB option. What type  
> of data
> are you working with? At what volumes? I sense that at 50GB/day, we  
> are
> higher than average in terms of data volume over time.
>

We have just short of 300TB of raw disk; our daily downloads range  
from a few GB to 10TB.

We bought 1.5TB drives separately from the nodes and sent students  
with screwdrivers at the cluster.

>
>> As Steve mentions below, the rest is really up to your algorithm.   
>> Do you
>> need 1 CPU second / byte?  If so, buy more CPUs.  Do you need .1  
>> CPU second
>> / MB?  If so, buy more disks.
>>
>
> Unfortunately, we won't know until we have a cluster to test on.  
> Classic
> catch-22. We are going to experiment with a small cluster and a  
> small data
> set, with plans to buy more appropriately sized slave nodes based on  
> what we
> learn.
>

In that case, you're probably good!  24TB probably formats out to  
20TB.  With 2x replication at 50GB a day, you've got enough room for  
about half a year of data.  Hope your procurement process isn't too  
slow!

Brian


Re: hadoop hardware configuration

Posted by Patrick Angeles <pa...@gmail.com>.
On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
> We do both -- push the disk image out to NFS and have a mirrored SAS hard
> drives on the namenode.  The SAS drives appear to be overkill.
>

This sounds like a nice approach, taking into account hardware, labor and
downtime costs... $700 for a RAID controller seems reasonable to minimize
maintenance due to a disk failure. Alex's suggestion to go JBOD and write to
all volumes would work as well, but slightly more labor intensive.


>>   2. What is a good processor-to-storage ratio for a task node with 4TB of
>>>  raw storage? (The config above has 1 core per 1TB of raw storage.)
>>>
>>
>
> We're data hungry locally -- I'd put in bigger hard drives.  The 1.5TB
> Seagate drives seem to have passed their teething issues, and are at a
> pretty sweet price point.  They only will scale up to 60 IOPS, so make sure
> your workflows don't have lots of random I/O.
>

I haven't seen too many vendors offering the 1.5TB option. What type of data
are you working with? At what volumes? I sense that at 50GB/day, we are
higher than average in terms of data volume over time.


> As Steve mentions below, the rest is really up to your algorithm.  Do you
> need 1 CPU second / byte?  If so, buy more CPUs.  Do you need .1 CPU second
> / MB?  If so, buy more disks.
>

Unfortunately, we won't know until we have a cluster to test on. Classic
catch-22. We are going to experiment with a small cluster and a small data
set, with plans to buy more appropriately sized slave nodes based on what we
learn.

- P

Re: hadoop hardware configuration

Posted by Brian Bockelman <bb...@cse.unl.edu>.
On May 28, 2009, at 5:02 AM, Steve Loughran wrote:

> Patrick Angeles wrote:
>> Sorry for cross-posting, I realized I sent the following to the  
>> hbase list
>> when it's really more a Hadoop question.
>
> This is an interesting question. Obviously as an HP employee you  
> must assume that I'm biased when I say HP DL160 servers are good   
> value for the workers, though our blade systems are very good for a  
> high physical density -provided you have the power to fill up the  
> rack.

:)

As an HP employee, can you provide us with coupons?

>
>> 2 x Hadoop Master (and Secondary NameNode)
>>   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
>>   - 16GB DDR2-800 Registered ECC Memory
>>   - 4 x 1TB 7200rpm SATA II Drives
>>   - Hardware RAID controller
>>   - Redundant Power Supply
>>   - Approx. 390W power draw (1.9amps 208V)
>>   - Approx. $4000 per unit
>
> I do not know the what the advantages of that many cores are on a  
> NN. Someone needs to do some experiments. I do know you need enough  
> RAM to hold the index in memory, and you may want to go for a bigger  
> block size to keep the index size down.
>

Despite my trying, I've never been able to come even close to pegging  
the CPUs on our NN.

I'd recommend going for the fastest dual-cores which are affordable --  
latency is king.

Of course, with the size of your cluster, I'd spend a little less  
money here and get more disk space.

>
>> 6 x Hadoop Task Nodes
>>   - 1 x 2.3Ghz Quad Core (Opteron 1356)
>>   - 8GB DDR2-800 Registered ECC Memory
>>   - 4 x 1TB 7200rpm SATA II Drives
>>   - No RAID (JBOD)
>>   - Non-Redundant Power Supply
>>   - Approx. 210W power draw (1.0amps 208V)
>>   - Approx. $2000 per unit
>> I had some specific questions regarding this configuration...
>
>
>>   1. Is hardware RAID necessary for the master node?
>
>
> You need a good story to ensure that loss of a disk on the master  
> doesn't lose the filesystem. I like RAID there, but the alternative  
> is to push the stuff out over the network to other storage you  
> trust. That could be NFS-mounted RAID storage, it could be NFS  
> mounted JBOD. Whatever your chosen design, test it works before you  
> go live by running the cluster then simulate different failures, see  
> how well the hardware/ops team handles it.
>
> Keep an eye on where that data goes, because when the NN runs out of  
> file storage, the consequences can be pretty dramatic (i,e the  
> cluster doesnt come up unless you edit the editlog by hand)

We do both -- push the disk image out to NFS and have a mirrored SAS  
hard drives on the namenode.  The SAS drives appear to be overkill.

>
>>   2. What is a good processor-to-storage ratio for a task node with  
>> 4TB of
>>   raw storage? (The config above has 1 core per 1TB of raw storage.)


We're data hungry locally -- I'd put in bigger hard drives.  The 1.5TB  
Seagate drives seem to have passed their teething issues, and are at a  
pretty sweet price point.  They only will scale up to 60 IOPS, so make  
sure your workflows don't have lots of random I/O.

As Steve mentions below, the rest is really up to your algorithm.  Do  
you need 1 CPU second / byte?  If so, buy more CPUs.  Do you need .1  
CPU second / MB?  If so, buy more disks.

Brian

>
> That really depends on the work you are doing...the bytes in/out to  
> CPU work, and the size of any memory structures that are built up  
> over the run.
>
> With 1 core per physical disk, you get the bandwidth of a single  
> disk per CPU; for some IO-intensive work you can make the case for  
> two disks/CPU -one in, one out, but then you are using more power,  
> and if/when you want to add more storage, you have to pull out the  
> disks to stick in new ones. If you go for more CPUs, you will  
> probably need more RAM to go with it.
>
>>   3. Am I better off using dual quads for a task node, with a  
>> higher power
>>   draw? Dual quad task node with 16GB RAM and 4TB storage costs  
>> roughly $3200,
>>   but draws almost 2x as much power. The tradeoffs are:
>>      1. I will get more CPU per dollar and per watt.
>>      2. I will only be able to fit 1/2 as much dual quad machines  
>> into a
>>      rack.
>>      3. I will get 1/2 the storage capacity per watt.
>>      4. I will get less I/O throughput overall (less spindles per  
>> core)
>
> First there is the algorithm itself, and whether you are IO or CPU  
> bound. Most MR jobs that I've encountered are fairly IO bound - 
> without indexes, every lookup has to stream through all the data, so  
> it's power inefficient and IO limited. but if you are trying to do  
> higher level stuff than just lookup, then you will be doing more CPU- 
> work
>
> Then there is the question of where your electricity comes from,  
> what the limits for the room are, whether you are billed on power  
> drawn or quoted PSU draw, what the HVAC limits are, what the maximum  
> allowed weight per rack is, etc, etc.
>
> I'm a fan of low Joule work, though we don't have any benchmarks yet  
> of the power efficiency of different clusters; the number of MJ used  
> to do a a terasort. I'm debating doing some single-cpu tests for  
> this on my laptop, as the battery knows how much gets used up by  
> some work.
>
>>   4. In planning storage capacity, how much spare disk space should  
>> I take
>>   into account for 'scratch'? For now, I'm assuming 1x the input  
>> data size.
>
> That you should probably be able to determine on experimental work  
> on smaller datasets. Some maps can throw out a lot of data, most  
> reduces do actually reduce the final amount.
>
>
> -Steve
>
> (Disclaimer: I'm not making any official recommendations for  
> hardware here, just making my opinions known. If you do want an  
> official recommendation from HP, talk to your reseller or account  
> manager, someone will look at your problem in more detail and make  
> some suggestions. If you have any code/data that could be shared for  
> benchmarking, that would help validate those suggestions)


Re: hadoop hardware configuration

Posted by Steve Loughran <st...@apache.org>.
Patrick Angeles wrote:
> Sorry for cross-posting, I realized I sent the following to the hbase list
> when it's really more a Hadoop question.

This is an interesting question. Obviously as an HP employee you must 
assume that I'm biased when I say HP DL160 servers are good  value for 
the workers, though our blade systems are very good for a high physical 
density -provided you have the power to fill up the rack.

> 
> 2 x Hadoop Master (and Secondary NameNode)
> 
>    - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
>    - 16GB DDR2-800 Registered ECC Memory
>    - 4 x 1TB 7200rpm SATA II Drives
>    - Hardware RAID controller
>    - Redundant Power Supply
>    - Approx. 390W power draw (1.9amps 208V)
>    - Approx. $4000 per unit

I do not know the what the advantages of that many cores are on a NN. 
Someone needs to do some experiments. I do know you need enough RAM to 
hold the index in memory, and you may want to go for a bigger block size 
to keep the index size down.


> 6 x Hadoop Task Nodes
> 
>    - 1 x 2.3Ghz Quad Core (Opteron 1356)
>    - 8GB DDR2-800 Registered ECC Memory
>    - 4 x 1TB 7200rpm SATA II Drives
>    - No RAID (JBOD)
>    - Non-Redundant Power Supply
>    - Approx. 210W power draw (1.0amps 208V)
>    - Approx. $2000 per unit
> 
> I had some specific questions regarding this configuration...


>    1. Is hardware RAID necessary for the master node?


You need a good story to ensure that loss of a disk on the master 
doesn't lose the filesystem. I like RAID there, but the alternative is 
to push the stuff out over the network to other storage you trust. That 
could be NFS-mounted RAID storage, it could be NFS mounted JBOD. 
Whatever your chosen design, test it works before you go live by running 
the cluster then simulate different failures, see how well the 
hardware/ops team handles it.

Keep an eye on where that data goes, because when the NN runs out of 
file storage, the consequences can be pretty dramatic (i,e the cluster 
doesnt come up unless you edit the editlog by hand)

>    2. What is a good processor-to-storage ratio for a task node with 4TB of
>    raw storage? (The config above has 1 core per 1TB of raw storage.)

That really depends on the work you are doing...the bytes in/out to CPU 
work, and the size of any memory structures that are built up over the run.

With 1 core per physical disk, you get the bandwidth of a single disk 
per CPU; for some IO-intensive work you can make the case for two 
disks/CPU -one in, one out, but then you are using more power, and 
if/when you want to add more storage, you have to pull out the disks to 
stick in new ones. If you go for more CPUs, you will probably need more 
RAM to go with it.

>    3. Am I better off using dual quads for a task node, with a higher power
>    draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200,
>    but draws almost 2x as much power. The tradeoffs are:
>       1. I will get more CPU per dollar and per watt.
>       2. I will only be able to fit 1/2 as much dual quad machines into a
>       rack.
>       3. I will get 1/2 the storage capacity per watt.
>       4. I will get less I/O throughput overall (less spindles per core)

First there is the algorithm itself, and whether you are IO or CPU 
bound. Most MR jobs that I've encountered are fairly IO bound -without 
indexes, every lookup has to stream through all the data, so it's power 
inefficient and IO limited. but if you are trying to do higher level 
stuff than just lookup, then you will be doing more CPU-work

Then there is the question of where your electricity comes from, what 
the limits for the room are, whether you are billed on power drawn or 
quoted PSU draw, what the HVAC limits are, what the maximum allowed 
weight per rack is, etc, etc.

I'm a fan of low Joule work, though we don't have any benchmarks yet of 
the power efficiency of different clusters; the number of MJ used to do 
a a terasort. I'm debating doing some single-cpu tests for this on my 
laptop, as the battery knows how much gets used up by some work.

>    4. In planning storage capacity, how much spare disk space should I take
>    into account for 'scratch'? For now, I'm assuming 1x the input data size.

That you should probably be able to determine on experimental work on 
smaller datasets. Some maps can throw out a lot of data, most reduces do 
actually reduce the final amount.


-Steve

(Disclaimer: I'm not making any official recommendations for hardware 
here, just making my opinions known. If you do want an official 
recommendation from HP, talk to your reseller or account manager, 
someone will look at your problem in more detail and make some 
suggestions. If you have any code/data that could be shared for 
benchmarking, that would help validate those suggestions)