You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by ylx_admin <ne...@hotmail.com> on 2009/09/29 19:57:36 UTC

Advice on new Datacenter Hadoop Cluster?

Hey all, 

I'm pretty new to hadoop in general and I've been tasked with building out a
datacenter cluster of hadoop servers to process logfiles. We currently use
Amazon but our heavy usage is starting to justify running our own servers.
I'm aiming for less than $1k per box, and of course trying to economize on
power/rack. Can anyone give me some advice on what to pay attention to when
building these server nodes?

TIA,
Kevin
-- 
View this message in context: http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Steve Loughran <st...@apache.org>.

Brian Bockelman wrote:

> * When one disk goes out, the datanode shuts down - meaning that 48 
> disks go out.  This is to be fixed in 0.21.0, I think.

That's right, though the NN doesn't report it, and I think once offline, 
that disk stays offline.

There's been discussion on a new JIRA issue regarding hotswap support, 
can I pull a disk out, plug in a new one and expect it to be possible to 
have the replacement brought in. You don't want to shut down 48TB just 
to replace one disk.

To be done properly, you really need a way to tell HDFS what you are 
doing, to make sure that any underreplicated data is pulled off that HDD 
onto the remaining disks, pause, kill the tasktracker. Once the new disk 
is inserted and mounted, it needs to be repopulated with blocks

-steve

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Oct 1, 2009, at 7:13 AM, Steve Loughran wrote:

> Ryan Smith wrote:
>> I have a question that i feel i should ask on this thread.  Lets  
>> say you
>> want to build a cluster where you will be doing very little map/ 
>> reduce,
>> storage and replication of data only on hdfs.  What would the  
>> hardware
>> requirements be?  No quad core? less ram?
>
> Servers with more HDD per CPU, and less RAM. CPUs are a big slice  
> not just of capital, but of your power budget. If you are running a  
> big datacentre, you will care about that electricity bill.
>
> Assuming you go for 1U with 6 HDD in a 1U box, you could have 6 or  
> 12 TB per U, then perhaps a 2-core or 4-core server with "enough"  
> ECC RAM.
>
> * with less M/R work, you could allocate most of that TB to work,  
> leave a few hundred GB for OS and logs
>
> * you'd better estimate external load; if the cluster is storing  
> data then total network bandwidth will be 3X the data ingress (for  
> replication = 3), read costs are that of the data itself. Also, 5  
> threads on 3 different machines handing the write and forward process.
>
> * I don't know how much load the datanode JVM would take with, say  
> 11 TB of managed storage underneath; that's memory and CPU time.

Datanode load is a function of the number of IOPS.  Basically, buying  
6 12TB nodes versus 3 24TB nodes, you double the number of IOPS per  
node.

If you're using HDFS solely for backup, then the number of IOPS is so  
small you can assume it's zero.  We use HDFS for a non-mapreduce  
physics application, and our particular application mix is such that I  
target 1 batch system core per usable HDFS TB.

>
> Is anyone out there running big datanodes? What do they see?
>

Our biggest is 48TB:
* They go offline for 5 minutes during the block reports.  We use rack  
awareness to make sure that both copies are not on big data nodes.   
Fixed in future releases (0.20.0 even, maybe).
* When one disk goes out, the datanode shuts down - meaning that 48  
disks go out.  This is to be fixed in 0.21.0, I think.
* The CPUs (4 cores) are pegged when the system is under full load.   
If I had a chance, I'd give it more CPU horsepower.

As usual, everyone's application is different enough that any anecdote  
is possibly not applicable.

Brian

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Steve Loughran <st...@apache.org>.

Ryan Smith wrote:
> I have a question that i feel i should ask on this thread.  Lets say you
> want to build a cluster where you will be doing very little map/reduce,
> storage and replication of data only on hdfs.  What would the hardware
> requirements be?  No quad core? less ram?
> 

Servers with more HDD per CPU, and less RAM. CPUs are a big slice not 
just of capital, but of your power budget. If you are running a big 
datacentre, you will care about that electricity bill.

Assuming you go for 1U with 6 HDD in a 1U box, you could have 6 or 12 TB 
per U, then perhaps a 2-core or 4-core server with "enough" ECC RAM.

* with less M/R work, you could allocate most of that TB to work, leave 
a few hundred GB for OS and logs

* you'd better estimate external load; if the cluster is storing data 
then total network bandwidth will be 3X the data ingress (for 
replication = 3), read costs are that of the data itself. Also, 5 
threads on 3 different machines handing the write and forward process.

* I don't know how much load the datanode JVM would take with, say 11 TB 
of managed storage underneath; that's memory and CPU time.

Is anyone out there running big datanodes? What do they see?

-steve

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Ryan Smith <ry...@gmail.com>.

I have a question that i feel i should ask on this thread.  Lets say you
want to build a cluster where you will be doing very little map/reduce,
storage and replication of data only on hdfs.  What would the hardware
requirements be?  No quad core? less ram?

Thanks
-Ryan

On Thu, Oct 1, 2009 at 7:36 AM, tim robertson <ti...@gmail.com>wrote:

> Disclaimer: I am pretty useless when it comes to hardware
>
> I had a lot of issues with non ECC memory when running 100's millions
> inserts from MapReduce into HBase on a dev cluster.  The errors were
> checksum errors, and the consensus was the memory was causing the
> issues and all advice was to ensure ECC memory.  The same cluster ran
> without (any apparent) error for simple counting operations on tab
> delimited files.
>
> Cheers,
> Tim
>
> On Thu, Oct 1, 2009 at 11:49 AM, Steve Loughran <st...@apache.org> wrote:
> > Kevin Sweeney wrote:
> >>
> >> I really appreciate everyone's input. We've been going back and forth on
> >> the
> >> server size issue here. There are a few reasons we shot for the $1k
> price,
> >> one because we wanted to be able to compare our datacenter costs vs. the
> >> cloud costs. Another is that we have spec'd out a fast Intel node with
> >> over-the-counter parts. We have a hard time justifying the
> dual-processor
> >> costs and really don't see the need for the big server extras like
> >> out-of-band management and redundancy. This is our proposed config, feel
> >> free to criticize :)
> >> Supermicro 512L-260 Chassis $90
> >> Supermicro X8SIL                  $160
> >> Heatsink                                $22
> >> Intel 3460 Xeon                      $350
> >> Samsung 7200 RPM SATA2   2x$85
> >> 2GB Non-ECC DIMM              4x$65
> >>
> >> This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
> >> purpose of a hadoop cluster to build cheap,fast, replaceable nodes?
> >
> > Disclaimer 1: I work for a server vendor so may be biased. I will attempt
> to
> > avoid this by not pointing you at HP DL180 or SL170z servers.
> >
> > Disclaimer 2: I probably don't know what I'm talking about. As far as
> Hadoop
> > concerned, I'm not sure anyone knows what is "the right" configuration.
> >
> > * I'd consider ECC RAM. On a large cluster, over time, errors occur -you
> > either notice them or propagate the effects.
> >
> > * Worry about power, cooling and rack weight.
> >
> > * Include network costs, power budget. That's your own switch costs, plus
> > bandwidth in and out.
> >
> > * There are some good arguments in favour of fewer, higher end machines
> over
> > many smaller ones.  Less network traffic, often a higher density.
> >
> > The  cloud hosted vs owned is an interesting question; I suspect the
> > spreadsheet there is pretty complex
> >
> > * Estimate how much data you will want to store over time. On S3, those
> > costs ramp up fast; in your own rack you can maybe plan to stick in in an
> > extra 2TB HDD a year from now (space, power, cooling and weight
> permitting),
> > paying next year's prices for next year's capacity.
> >
> > * Virtual machine management costs are different from physical management
> > costs, especially if you dont invest time upfront on automating your
> > datacentre software provisioning (custom RPMs, PXE preboot, kickstart,
> etc).
> > VMMs you can almost hand manage an image (naughty, but possible), as long
> as
> > you have a single image or two to push out. Even then, i'd automate, but
> at
> > a higher level, creating images on demand as load/availablity sees fit.
> >
> > -Steve
> >
> >
> >
>

Re: Advice on new Datacenter Hadoop Cluster?

Posted by tim robertson <ti...@gmail.com>.

Disclaimer: I am pretty useless when it comes to hardware

I had a lot of issues with non ECC memory when running 100's millions
inserts from MapReduce into HBase on a dev cluster.  The errors were
checksum errors, and the consensus was the memory was causing the
issues and all advice was to ensure ECC memory.  The same cluster ran
without (any apparent) error for simple counting operations on tab
delimited files.

Cheers,
Tim

On Thu, Oct 1, 2009 at 11:49 AM, Steve Loughran <st...@apache.org> wrote:
> Kevin Sweeney wrote:
>>
>> I really appreciate everyone's input. We've been going back and forth on
>> the
>> server size issue here. There are a few reasons we shot for the $1k price,
>> one because we wanted to be able to compare our datacenter costs vs. the
>> cloud costs. Another is that we have spec'd out a fast Intel node with
>> over-the-counter parts. We have a hard time justifying the dual-processor
>> costs and really don't see the need for the big server extras like
>> out-of-band management and redundancy. This is our proposed config, feel
>> free to criticize :)
>> Supermicro 512L-260 Chassis $90
>> Supermicro X8SIL                  $160
>> Heatsink                                $22
>> Intel 3460 Xeon                      $350
>> Samsung 7200 RPM SATA2   2x$85
>> 2GB Non-ECC DIMM              4x$65
>>
>> This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
>> purpose of a hadoop cluster to build cheap,fast, replaceable nodes?
>
> Disclaimer 1: I work for a server vendor so may be biased. I will attempt to
> avoid this by not pointing you at HP DL180 or SL170z servers.
>
> Disclaimer 2: I probably don't know what I'm talking about. As far as Hadoop
> concerned, I'm not sure anyone knows what is "the right" configuration.
>
> * I'd consider ECC RAM. On a large cluster, over time, errors occur -you
> either notice them or propagate the effects.
>
> * Worry about power, cooling and rack weight.
>
> * Include network costs, power budget. That's your own switch costs, plus
> bandwidth in and out.
>
> * There are some good arguments in favour of fewer, higher end machines over
> many smaller ones.  Less network traffic, often a higher density.
>
> The  cloud hosted vs owned is an interesting question; I suspect the
> spreadsheet there is pretty complex
>
> * Estimate how much data you will want to store over time. On S3, those
> costs ramp up fast; in your own rack you can maybe plan to stick in in an
> extra 2TB HDD a year from now (space, power, cooling and weight permitting),
> paying next year's prices for next year's capacity.
>
> * Virtual machine management costs are different from physical management
> costs, especially if you dont invest time upfront on automating your
> datacentre software provisioning (custom RPMs, PXE preboot, kickstart, etc).
> VMMs you can almost hand manage an image (naughty, but possible), as long as
> you have a single image or two to push out. Even then, i'd automate, but at
> a higher level, creating images on demand as load/availablity sees fit.
>
> -Steve
>
>
>

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Steve Loughran <st...@apache.org>.

Kevin Sweeney wrote:
> I really appreciate everyone's input. We've been going back and forth on the
> server size issue here. There are a few reasons we shot for the $1k price,
> one because we wanted to be able to compare our datacenter costs vs. the
> cloud costs. Another is that we have spec'd out a fast Intel node with
> over-the-counter parts. We have a hard time justifying the dual-processor
> costs and really don't see the need for the big server extras like
> out-of-band management and redundancy. This is our proposed config, feel
> free to criticize :)
> Supermicro 512L-260 Chassis $90
> Supermicro X8SIL                  $160
> Heatsink                                $22
> Intel 3460 Xeon                      $350
> Samsung 7200 RPM SATA2   2x$85
> 2GB Non-ECC DIMM              4x$65
> 
> This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
> purpose of a hadoop cluster to build cheap,fast, replaceable nodes?

Disclaimer 1: I work for a server vendor so may be biased. I will 
attempt to avoid this by not pointing you at HP DL180 or SL170z servers.

Disclaimer 2: I probably don't know what I'm talking about. As far as 
Hadoop concerned, I'm not sure anyone knows what is "the right" 
configuration.

* I'd consider ECC RAM. On a large cluster, over time, errors occur -you 
either notice them or propagate the effects.

* Worry about power, cooling and rack weight.

* Include network costs, power budget. That's your own switch costs, 
plus bandwidth in and out.

* There are some good arguments in favour of fewer, higher end machines 
over many smaller ones.  Less network traffic, often a higher density.

The  cloud hosted vs owned is an interesting question; I suspect the 
spreadsheet there is pretty complex

* Estimate how much data you will want to store over time. On S3, those 
costs ramp up fast; in your own rack you can maybe plan to stick in in 
an extra 2TB HDD a year from now (space, power, cooling and weight 
permitting), paying next year's prices for next year's capacity.

* Virtual machine management costs are different from physical 
management costs, especially if you dont invest time upfront on 
automating your datacentre software provisioning (custom RPMs, PXE 
preboot, kickstart, etc). VMMs you can almost hand manage an image 
(naughty, but possible), as long as you have a single image or two to 
push out. Even then, i'd automate, but at a higher level, creating 
images on demand as load/availablity sees fit.

-Steve

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Ted Dunning <te...@gmail.com>.

Depending on your needs and the size of your cluster, the out-of-band
management can be of significant interest.  It is a pretty simple
cost/benefit analysis that trades your sysops time (which is probably about
the equivalent of $50-150 per hour fully loaded and accounting for
opportunity cost) versus the cost of IPMI cards.  If it takes an extra hour
of time to actually go to the data center per event and possibly another
hour of time because the data center is a lousy place to work, then the IPMI
card is probably about break-even.  In our case, it is more than an hour of
inconvenience and our systems guy has LOTs of things to do so the board's
are a no-brainer.

You don't say here what size the disks are.  Dual disks are a good idea for
any number of reasons.  I just saw a price this morning of about $170 for a
2TB drive and about half that for a 1TB drive so make sure you are doing at
least that well.

You are specifying only 4GB of RAM.  I would account that as severely
underpowering your machine.  My own preference is to put 4-8x that much RAM
on a machine with one or two quad core CPU's and four drives.  That still
fits in a 1U chassis and will out-perform several of the boxes that you are
describing, although perhaps not exactly on a $/cycle even trade-off.

There are also some very sweet twin setups where you get two beefy machines
in a single 1U slot.  Very nice.  For instance, you can put two dual CPU
quad core Nehalem processors with 48GB, a bunch of disk into 1U for about
$14K including paying somebody to set up the machine and a 3 year
maintenance contract.  You should be able to do this yourself for $12K or
less and this is equivalent to about something between 6 to 30 of the nodes
that you are spec'ing (2 x 2 x 4 cores vs 4 cores = 4x (but round up because
of fancier processors), 96GB vs 4 GB = 32x).  Cut off another K$ or two
because this is an older quote and the 2TB drives are much cheaper suddenly
as well.

On Wed, Sep 30, 2009 at 3:46 PM, Kevin Sweeney <ke...@yieldex.com> wrote:

> I really appreciate everyone's input. We've been going back and forth on
> the
> server size issue here. There are a few reasons we shot for the $1k price,
> one because we wanted to be able to compare our datacenter costs vs. the
> cloud costs. Another is that we have spec'd out a fast Intel node with
> over-the-counter parts. We have a hard time justifying the dual-processor
> costs and really don't see the need for the big server extras like
> out-of-band management and redundancy. This is our proposed config, feel
> free to criticize :)
> Supermicro 512L-260 Chassis $90
> Supermicro X8SIL                  $160
> Heatsink                                $22
> Intel 3460 Xeon                      $350
> Samsung 7200 RPM SATA2   2x$85
> 2GB Non-ECC DIMM              4x$65
>
> This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
> purpose of a hadoop cluster to build cheap,fast, replaceable nodes?
>
>
>
> On Wed, Sep 30, 2009 at 9:06 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > 2TB drives are just now dropping to parity with 1TB on a $/GB basis.
> >
> > If you want space rather than speed, this is a good option.  If you want
> > speed rather than space, more spindles and smaller disks are better.
> > Ironically, 500GB drives now often cost more than 1TB drives (that is $,
> > not
> > $/GB).
> >
> > On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
> > <pa...@gmail.com>wrote:
> >
> > > We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might
> be
> > > overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
> > > virtual cores so 12GB might not have been enough. These boxes are
> around
> > > $4k
> > > each, but can easily outperform any $1K box dollar per dollar (and
> > > performance per watt).
> > >
> > > If you're extremely I/O bound, you can get single-socket configurations
> > > with
> > > the same amount of drive spindles for really cheap (~$2k for single
> proc,
> > > 8-12GB RAM, 4x1TB drives).
> > >
> > > On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
> > > <st...@deri.org>wrote:
> > >
> > > > Todd Lipcon wrote:
> > > >
> > > >> Most people building new clusters at this point seem to be leaning
> > > towards
> > > >> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
> > > >>
> > > >
> > > > We went with a similar configuration for a recently purchased cluster
> > but
> > > > opted for qual quad core Opterons (Shanghai) rather than Nehalems and
> > > > invested the difference in more memory per node (16GB). Nehalem seem
> to
> > > > perform very well on some benchmarks but that performance comes at a
> > > > premium. I guess it depends on your planned use of the cluster but in
> a
> > > lot
> > > > of cases more memory may be better spent, especially if you plan on
> > > running
> > > > things like HBase on the cluster also (which we do).
> > > >
> > > > -stephen
> > > >
> > > > --
> > > > Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> > > > NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> > > > http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
> > > >
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>
>
>
> --
> Kevin Sweeney
> Systems Engineer
> Yieldex -- www.yieldex.com
> (303) 999-7045
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Patrick Angeles <pa...@gmail.com>.

I wouldn't spec the worker nodes just to facilitate cloud cost comparison.
There's enough variability out there and you'd have to deal with storage,
network bandwidth and I/O. Not to mention a similarly spec'd virtual cloud
server will never perform as well as a physical server because you don't get
data locality. Unless you have something like Amazon's EBS, but then that
jacks up your costs.
Also, you shouldn't assume that 'big server' will include out-of-band
management or redundancy.

Also take into account performance per watt. Dual socket machines do better
here. Just like you, I wouldn't go with high ghz ('faster') Intel procs
because they are power hungry and generate lots of heat for the incremental
speed bump that you get. (After all, you're not building a gaming rig.)
However, you can go dual-socket with lower speed processors. I think the
lowest ghz Nehalems that support hyper-threading are good value. For
example, compare the Xeon 3460 @ 2.8ghz ($360) to the 3440 @ 2.53ghz ($240).
That's about a 10% speed bump for a 50% price increase, and that's without
factoring in the power consumption. Granted, you need to take into account
the cost of the entire server, not just the processor.


On Wed, Sep 30, 2009 at 6:46 PM, Kevin Sweeney <ke...@yieldex.com> wrote:

> I really appreciate everyone's input. We've been going back and forth on
> the
> server size issue here. There are a few reasons we shot for the $1k price,
> one because we wanted to be able to compare our datacenter costs vs. the
> cloud costs. Another is that we have spec'd out a fast Intel node with
> over-the-counter parts. We have a hard time justifying the dual-processor
> costs and really don't see the need for the big server extras like
> out-of-band management and redundancy. This is our proposed config, feel
> free to criticize :)
> Supermicro 512L-260 Chassis $90
> Supermicro X8SIL                  $160
> Heatsink                                $22
> Intel 3460 Xeon                      $350
> Samsung 7200 RPM SATA2   2x$85
> 2GB Non-ECC DIMM              4x$65
>
> This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
> purpose of a hadoop cluster to build cheap,fast, replaceable nodes?
>
>
>
> On Wed, Sep 30, 2009 at 9:06 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > 2TB drives are just now dropping to parity with 1TB on a $/GB basis.
> >
> > If you want space rather than speed, this is a good option.  If you want
> > speed rather than space, more spindles and smaller disks are better.
> > Ironically, 500GB drives now often cost more than 1TB drives (that is $,
> > not
> > $/GB).
> >
> > On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
> > <pa...@gmail.com>wrote:
> >
> > > We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might
> be
> > > overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
> > > virtual cores so 12GB might not have been enough. These boxes are
> around
> > > $4k
> > > each, but can easily outperform any $1K box dollar per dollar (and
> > > performance per watt).
> > >
> > > If you're extremely I/O bound, you can get single-socket configurations
> > > with
> > > the same amount of drive spindles for really cheap (~$2k for single
> proc,
> > > 8-12GB RAM, 4x1TB drives).
> > >
> > > On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
> > > <st...@deri.org>wrote:
> > >
> > > > Todd Lipcon wrote:
> > > >
> > > >> Most people building new clusters at this point seem to be leaning
> > > towards
> > > >> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
> > > >>
> > > >
> > > > We went with a similar configuration for a recently purchased cluster
> > but
> > > > opted for qual quad core Opterons (Shanghai) rather than Nehalems and
> > > > invested the difference in more memory per node (16GB). Nehalem seem
> to
> > > > perform very well on some benchmarks but that performance comes at a
> > > > premium. I guess it depends on your planned use of the cluster but in
> a
> > > lot
> > > > of cases more memory may be better spent, especially if you plan on
> > > running
> > > > things like HBase on the cluster also (which we do).
> > > >
> > > > -stephen
> > > >
> > > > --
> > > > Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> > > > NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> > > > http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
> > > >
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>
>
>
> --
> Kevin Sweeney
> Systems Engineer
> Yieldex -- www.yieldex.com
> (303) 999-7045
>

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Kevin Sweeney <ke...@yieldex.com>.

I really appreciate everyone's input. We've been going back and forth on the
server size issue here. There are a few reasons we shot for the $1k price,
one because we wanted to be able to compare our datacenter costs vs. the
cloud costs. Another is that we have spec'd out a fast Intel node with
over-the-counter parts. We have a hard time justifying the dual-processor
costs and really don't see the need for the big server extras like
out-of-band management and redundancy. This is our proposed config, feel
free to criticize :)
Supermicro 512L-260 Chassis $90
Supermicro X8SIL                  $160
Heatsink                                $22
Intel 3460 Xeon                      $350
Samsung 7200 RPM SATA2   2x$85
2GB Non-ECC DIMM              4x$65

This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
purpose of a hadoop cluster to build cheap,fast, replaceable nodes?

On Wed, Sep 30, 2009 at 9:06 PM, Ted Dunning <te...@gmail.com> wrote:

> 2TB drives are just now dropping to parity with 1TB on a $/GB basis.
>
> If you want space rather than speed, this is a good option.  If you want
> speed rather than space, more spindles and smaller disks are better.
> Ironically, 500GB drives now often cost more than 1TB drives (that is $,
> not
> $/GB).
>
> On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
> <pa...@gmail.com>wrote:
>
> > We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be
> > overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
> > virtual cores so 12GB might not have been enough. These boxes are around
> > $4k
> > each, but can easily outperform any $1K box dollar per dollar (and
> > performance per watt).
> >
> > If you're extremely I/O bound, you can get single-socket configurations
> > with
> > the same amount of drive spindles for really cheap (~$2k for single proc,
> > 8-12GB RAM, 4x1TB drives).
> >
> > On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
> > <st...@deri.org>wrote:
> >
> > > Todd Lipcon wrote:
> > >
> > >> Most people building new clusters at this point seem to be leaning
> > towards
> > >> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
> > >>
> > >
> > > We went with a similar configuration for a recently purchased cluster
> but
> > > opted for qual quad core Opterons (Shanghai) rather than Nehalems and
> > > invested the difference in more memory per node (16GB). Nehalem seem to
> > > perform very well on some benchmarks but that performance comes at a
> > > premium. I guess it depends on your planned use of the cluster but in a
> > lot
> > > of cases more memory may be better spent, especially if you plan on
> > running
> > > things like HBase on the cluster also (which we do).
> > >
> > > -stephen
> > >
> > > --
> > > Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> > > NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> > > http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
Kevin Sweeney
Systems Engineer
Yieldex -- www.yieldex.com
(303) 999-7045

Re: Advice on new Datacenter Hadoop Cluster?

Posted by stephen mulcahy <st...@deri.org>.

One other data point here - you may find that some vendors will supply 
servers using "enterprise" drives rather than "consumer" drives (see 
http://www.wdc.com/en/products/Products.asp?DriveID=503 versus 
http://www.wdc.com/en/products/Products.asp?DriveID=576 for an example).

You'll find the enterprise drives are generally more expensive and don't 
follow the same price drop curve as the consumer drives. I can't 
definitively say whether the enterprise drives are worth the extra spend 
- the vendors will tell you that the consumer class drives aren't 
designed/manufactured to run 24x7 like the enterprise drives (apparently 
the enterprise drives are designed to be more vibration resistant also, 
which is a feature aimed at the situation where you have large numbers 
of these drives in RAID arrays).

In some unscientific performance comparisons between drives, I have 
found the RE3s to have "better" performance than the consumer WDC green 
drives (but a more rigorous comparison from someone would be most welcome).

-stephen

Ted Dunning wrote:
> 2TB drives are just now dropping to parity with 1TB on a $/GB basis.
> 
> If you want space rather than speed, this is a good option.  If you want
> speed rather than space, more spindles and smaller disks are better.
> Ironically, 500GB drives now often cost more than 1TB drives (that is $, not
> $/GB).
> 
> On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
> <pa...@gmail.com>wrote:
> 
>> We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be
>> overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
>> virtual cores so 12GB might not have been enough. These boxes are around
>> $4k
>> each, but can easily outperform any $1K box dollar per dollar (and
>> performance per watt).
>>
>> If you're extremely I/O bound, you can get single-socket configurations
>> with
>> the same amount of drive spindles for really cheap (~$2k for single proc,
>> 8-12GB RAM, 4x1TB drives).
>>
>> On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
>> <st...@deri.org>wrote:
>>
>>> Todd Lipcon wrote:
>>>
>>>> Most people building new clusters at this point seem to be leaning
>> towards
>>>> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
>>>>
>>> We went with a similar configuration for a recently purchased cluster but
>>> opted for qual quad core Opterons (Shanghai) rather than Nehalems and
>>> invested the difference in more memory per node (16GB). Nehalem seem to
>>> perform very well on some benchmarks but that performance comes at a
>>> premium. I guess it depends on your planned use of the cluster but in a
>> lot
>>> of cases more memory may be better spent, especially if you plan on
>> running
>>> things like HBase on the cluster also (which we do).
>>>
>>> -stephen
>>>
>>> --
>>> Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
>>> NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
>>> http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
>>>
> 
> 
> 


-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Ted Dunning <te...@gmail.com>.

2TB drives are just now dropping to parity with 1TB on a $/GB basis.

If you want space rather than speed, this is a good option.  If you want
speed rather than space, more spindles and smaller disks are better.
Ironically, 500GB drives now often cost more than 1TB drives (that is $, not
$/GB).

On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
<pa...@gmail.com>wrote:

> We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be
> overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
> virtual cores so 12GB might not have been enough. These boxes are around
> $4k
> each, but can easily outperform any $1K box dollar per dollar (and
> performance per watt).
>
> If you're extremely I/O bound, you can get single-socket configurations
> with
> the same amount of drive spindles for really cheap (~$2k for single proc,
> 8-12GB RAM, 4x1TB drives).
>
> On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
> <st...@deri.org>wrote:
>
> > Todd Lipcon wrote:
> >
> >> Most people building new clusters at this point seem to be leaning
> towards
> >> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
> >>
> >
> > We went with a similar configuration for a recently purchased cluster but
> > opted for qual quad core Opterons (Shanghai) rather than Nehalems and
> > invested the difference in more memory per node (16GB). Nehalem seem to
> > perform very well on some benchmarks but that performance comes at a
> > premium. I guess it depends on your planned use of the cluster but in a
> lot
> > of cases more memory may be better spent, especially if you plan on
> running
> > things like HBase on the cluster also (which we do).
> >
> > -stephen
> >
> > --
> > Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> > NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> > http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Patrick Angeles <pa...@gmail.com>.

We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be
overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
virtual cores so 12GB might not have been enough. These boxes are around $4k
each, but can easily outperform any $1K box dollar per dollar (and
performance per watt).

If you're extremely I/O bound, you can get single-socket configurations with
the same amount of drive spindles for really cheap (~$2k for single proc,
8-12GB RAM, 4x1TB drives).

On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
<st...@deri.org>wrote:

> Todd Lipcon wrote:
>
>> Most people building new clusters at this point seem to be leaning towards
>> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
>>
>
> We went with a similar configuration for a recently purchased cluster but
> opted for qual quad core Opterons (Shanghai) rather than Nehalems and
> invested the difference in more memory per node (16GB). Nehalem seem to
> perform very well on some benchmarks but that performance comes at a
> premium. I guess it depends on your planned use of the cluster but in a lot
> of cases more memory may be better spent, especially if you plan on running
> things like HBase on the cluster also (which we do).
>
> -stephen
>
> --
> Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
>

Re: Advice on new Datacenter Hadoop Cluster?

Posted by stephen mulcahy <st...@deri.org>.

Todd Lipcon wrote:
> Most people building new clusters at this point seem to be leaning towards
> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.

We went with a similar configuration for a recently purchased cluster 
but opted for qual quad core Opterons (Shanghai) rather than Nehalems 
and invested the difference in more memory per node (16GB). Nehalem seem 
to perform very well on some benchmarks but that performance comes at a 
premium. I guess it depends on your planned use of the cluster but in a 
lot of cases more memory may be better spent, especially if you plan on 
running things like HBase on the cluster also (which we do).

-stephen

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Amandeep Khurana <am...@gmail.com>.

Also, if you plan to run HBase as well (now or in the future), you'll need
more RAM. Take that into account too.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Sep 29, 2009 at 10:59 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Kevin,
>
> Less than $1k/box is unrealistic and won't be your best price/performance.
>
> Most people building new clusters at this point seem to be leaning towards
> dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
>
> You're better off starting with a small cluster of these nicer machines
> than
> 3x as many $1k machines, assuming you can afford at least 4-5 of them.
>
> -Todd
>
> On Tue, Sep 29, 2009 at 10:57 AM, ylx_admin <ne...@hotmail.com> wrote:
>
> >
> > Hey all,
> >
> > I'm pretty new to hadoop in general and I've been tasked with building
> out
> > a
> > datacenter cluster of hadoop servers to process logfiles. We currently
> use
> > Amazon but our heavy usage is starting to justify running our own
> servers.
> > I'm aiming for less than $1k per box, and of course trying to economize
> on
> > power/rack. Can anyone give me some advice on what to pay attention to
> when
> > building these server nodes?
> >
> > TIA,
> > Kevin
> > --
> > View this message in context:
> >
> http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: Advice on new Datacenter Hadoop Cluster?

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Kevin,

Less than $1k/box is unrealistic and won't be your best price/performance.

Most people building new clusters at this point seem to be leaning towards
dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.

You're better off starting with a small cluster of these nicer machines than
3x as many $1k machines, assuming you can afford at least 4-5 of them.

-Todd

On Tue, Sep 29, 2009 at 10:57 AM, ylx_admin <ne...@hotmail.com> wrote:

>
> Hey all,
>
> I'm pretty new to hadoop in general and I've been tasked with building out
> a
> datacenter cluster of hadoop servers to process logfiles. We currently use
> Amazon but our heavy usage is starting to justify running our own servers.
> I'm aiming for less than $1k per box, and of course trying to economize on
> power/rack. Can anyone give me some advice on what to pay attention to when
> building these server nodes?
>
> TIA,
> Kevin
> --
> View this message in context:
> http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>