You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Miles Spielberg <mi...@box.net> on 2011/05/12 01:22:30 UTC

Hardware configuration for a pure-Hbase cluster

We're planning out our first Hbase cluster, and we'd like to get some feedback on our proposed hardware configuration. We're intending to use this cluster purely for Hbase; it will not generally be running MapReduce jobs, nor will we be using HDFS for other storage tasks. In addition, our projected total dataset size is <1 TB. Our workload is still unclear, but will likely be roughly 1:1 read:write ratio, with cell sizes <1 KB and significant use of increment().

Here's our current front-runner:
2U, 2-socket, 12-core (with HyperThreading for 24 OS-visible threads), probably E5645 (2.4~2.67 GHz) or X5675 (3.06~3.46 GHz)
48 GB RAM
2x 300 GB 10k SAS in RAID-1 for OS
12x 600 GB 15k SAS as JBOD for DataNode

We are thinking of putting in 4 of these as DataNode/HRegionServer machines, with another pair minus the 600GB drives as head nodes. The motivation behind the high-end disks and capacious RAM is that we anticipate being I/O bound, but we're concerned that we may be overspending, and/or selling ourselves short on total capacity. Still, this is a long way from the "commodity hardware" mantra, and we're considering whether we should go with 7200 RPM drives for more capacity and lower cost. It's also a big unit of failure for when the unexpected happens and takes down a node.

What's the current thinking on disk vs. CPU for pure Hbase usage on modern hardware? How much disk can one core comfortably service? 1x 7200? 2x 7200? 2x 15k?
Do we want to lean towards more, cheaper nodes? It would also give us more network throughput per disk, which would be nice to speed up re-replication on node failure.

One possibility is to use the same chassis, but leave it half-populated: 1-socket, 6-core, 24 GB RAM, 6x data drives. The question of fast disks vs. big disks and how many still applies.

Another possibility is to go with 1U units with 4x 1TB drives each, although this would likely mean giving up on RAID-1 for the OS. These would probably be 6-core E5645, with 24 GB RAM. We'd be able to get 10 or so of these. I'm concerned that 4 7200 RPM drives would not be able to keep a 6-core CPU fed, especially with OS load on one of the drives effectively reducing data spindles to ~3.5.

I expect that we won't really understand our workload until we have the cluster deployed and loaded, but we'd like to make our first pass more than a shot in the dark. Any feedback you may have is most appreciated.

-- 
Miles Spielberg


Re: Hardware configuration for a pure-Hbase cluster

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Inline.

J-D

On Wed, May 11, 2011 at 4:22 PM, Miles Spielberg <mi...@box.net> wrote:
> We're planning out our first Hbase cluster, and we'd like to get some feedback on our proposed hardware configuration. We're intending to use this cluster purely for Hbase; it will not generally be running MapReduce jobs, nor will we be using HDFS for other storage tasks. In addition, our projected total dataset size is <1 TB. Our workload is still unclear, but will likely be roughly 1:1 read:write ratio, with cell sizes <1 KB and significant use of increment().

If your workload is unclear then planning for a whole cluster is a
risky business... unless you overcommit resources.

>
> Here's our current front-runner:
> 2U, 2-socket, 12-core (with HyperThreading for 24 OS-visible threads), probably E5645 (2.4~2.67 GHz) or X5675 (3.06~3.46 GHz)
> 48 GB RAM
> 2x 300 GB 10k SAS in RAID-1 for OS
> 12x 600 GB 15k SAS as JBOD for DataNode
>
> We are thinking of putting in 4 of these as DataNode/HRegionServer machines, with another pair minus the 600GB drives as head nodes. The motivation behind the high-end disks and capacious RAM is that we anticipate being I/O bound, but we're concerned that we may be overspending, and/or selling ourselves short on total capacity. Still, this is a long way from the "commodity hardware" mantra, and we're considering whether we should go with 7200 RPM drives for more capacity and lower cost. It's also a big unit of failure for when the unexpected happens and takes down a node.

We prefer to stay on SATA, but your mileage may vary. Again, testing
actual hardware against your usage pattern would really help making
good decisions. Also with such a low number of nodes any failure will
have a huge impact.

>
> What's the current thinking on disk vs. CPU for pure Hbase usage on modern hardware? How much disk can one core comfortably service? 1x 7200? 2x 7200? 2x 15k?
> Do we want to lean towards more, cheaper nodes? It would also give us more network throughput per disk, which would be nice to speed up re-replication on node failure.

With 12 SAS disks you'll be bound on the network, unless you go with
10GE. We bought new hardware recently which much lower end CPUs (more
energy efficient, a big concern we have at our size) and we have 6
disks per node (one disk has a root partition) using 1GE.

>
> One possibility is to use the same chassis, but leave it half-populated: 1-socket, 6-core, 24 GB RAM, 6x data drives. The question of fast disks vs. big disks and how many still applies.

The HW we got looks like this (each node is 2x L5630, 48gb ram, 6x 2TB):

SM 2U Twin 2 System (Includes 2 motherboard, 1 Chassis with Single
1400W PS, Quick-Quick rail kits, Backplane)
INTEL XEON 4C 2.13GHZ L5630 12M QPI DDR3 S1366
8GB Dual Rank 1333MHZ ECC REG DDR3 - Motherboard compatible
Hitachi 2TB 7200RPM 64MB SATA 6Gbps HDD

So this packs basically 2 nodes in one 2U box. We are going to use
this in both MR and live environments.

>
> Another possibility is to go with 1U units with 4x 1TB drives each, although this would likely mean giving up on RAID-1 for the OS. These would probably be 6-core E5645, with 24 GB RAM. We'd be able to get 10 or so of these. I'm concerned that 4 7200 RPM drives would not be able to keep a 6-core CPU fed, especially with OS load on one of the drives effectively reducing data spindles to ~3.5.

I understand the RAID 1 concern, but with more machines you won't be
affected as much when one dies so maybe bit the bullet, ditch the 2
extra disks and get more machines?

FWIW we are able to max out our current 2x E5520s with 4 disks using
big MR jobs.

>
> I expect that we won't really understand our workload until we have the cluster deployed and loaded, but we'd like to make our first pass more than a shot in the dark. Any feedback you may have is most appreciated.

The workload is really really important, but so is a high number of
nodes. Else you might as well be using MySQL since you won't
beneficiate from HBase's strengths and only suffer its weaknesses.

Hope I helped!

Hardware configuration for a pure-Hbase cluster

Posted by Miles Spielberg <mi...@box.net>.
We're planning out our first Hbase cluster, and we'd like to get some
feedback on our proposed hardware configuration. We're intending to use this
cluster purely for Hbase; it will not generally be running MapReduce jobs,
nor will we be using HDFS for other storage tasks. In addition, our
projected total dataset size is <1 TB. Our workload is still unclear, but
will likely be roughly 1:1 read:write ratio, with cell sizes <1 KB and
significant use of increment().

Here's our current front-runner:
2U, 2-socket, 12-core (with HyperThreading for 24 OS-visible threads),
probably E5645 (2.4~2.67 GHz) or X5675 (3.06~3.46 GHz)
48 GB RAM
2x 300 GB 10k SAS in RAID-1 for OS
12x 600 GB 15k SAS as JBOD for DataNode

We are thinking of putting in 4 of these as DataNode/HRegionServer machines,
with another pair minus the 600GB drives as head nodes. The motivation
behind the high-end disks and capacious RAM is that we anticipate being I/O
bound, but we're concerned that we may be overspending, and/or selling
ourselves short on total capacity. Still, this is a long way from the
"commodity hardware" mantra, and we're considering whether we should go with
7200 RPM drives for more capacity and lower cost. It's also a big unit of
failure for when the unexpected happens and takes down a node.

What's the current thinking on disk vs. CPU for pure Hbase usage on modern
hardware? How much disk can one core comfortably service? 1x 7200? 2x 7200?
2x 15k?
Do we want to lean towards more, cheaper nodes? It would also give us more
network throughput per disk, which would be nice to speed up re-replication
on node failure.

One possibility is to use the same chassis, but leave it half-populated:
1-socket, 6-core, 24 GB RAM, 6x data drives. The question of fast disks vs.
big disks and how many still applies.

Another possibility is to go with 1U units with 4x 1TB drives each, although
this would likely mean giving up on RAID-1 for the OS. These would probably
be 6-core E5645, with 24 GB RAM. We'd be able to get 10 or so of these. I'm
concerned that 4 7200 RPM drives would not be able to keep a 6-core CPU fed,
especially with OS load on one of the drives effectively reducing data
spindles to ~3.5.

I expect that we won't really understand our workload until we have the
cluster deployed and loaded, but we'd like to make our first pass more than
a shot in the dark. Any feedback you may have is most appreciated.

--
Miles Spielberg