You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "S. Nunes" <sn...@gmail.com> on 2008/03/05 16:16:22 UTC

Hardware Details for a Small Cluster

Hi,

I'm trying to deploy a small Hadoop cluster for our research lab.
We are in the process of selecting the hardware for this cluster. We
are aiming at a 12 CPU, 5 TB cluster. This is obviously a very rough
estimation.

I have a few questions and I would greatly appreciate your feedback.

Which is better, a cluster based on many low performance nodes; or a
cluster with fewer but high performance nodes? For instance, should I
bet on a cluster with 4 nodes (1 CPU + 100 GB each) or on a cluster
with 2 nodes (2 CPU + 200 GB each)?

What should be considered regarding node homogeneity? I understand
that a very unbalanced cluster would result in a "long tailed"
performance - slower nodes would penalize the overall performance.
However, how critical is that? Do you have performance numbers to
support our decision?

Finally, do you recommend any specific hardware configuration for
starting a cluster (rack, blade, tower...) ?

Thanks in advance for your comments,

--
Sérgio Nunes

Re: Hardware Details for a Small Cluster

Posted by "S. Nunes" <sn...@gmail.com>.
Just found this document that seems to answer all my initial questions.

http://wiki.apache.org/hadoop/MachineScaling

Thanks anyway,
--
Sérgio Nunes

On Wed, Mar 5, 2008 at 3:16 PM, S. Nunes <sn...@gmail.com> wrote:
> Hi,
>
>  I'm trying to deploy a small Hadoop cluster for our research lab.
>  We are in the process of selecting the hardware for this cluster. We
>  are aiming at a 12 CPU, 5 TB cluster. This is obviously a very rough
>  estimation.
>
>  I have a few questions and I would greatly appreciate your feedback.
>
>  Which is better, a cluster based on many low performance nodes; or a
>  cluster with fewer but high performance nodes? For instance, should I
>  bet on a cluster with 4 nodes (1 CPU + 100 GB each) or on a cluster
>  with 2 nodes (2 CPU + 200 GB each)?
>
>  What should be considered regarding node homogeneity? I understand
>  that a very unbalanced cluster would result in a "long tailed"
>  performance - slower nodes would penalize the overall performance.
>  However, how critical is that? Do you have performance numbers to
>  support our decision?
>
>  Finally, do you recommend any specific hardware configuration for
>  starting a cluster (rack, blade, tower...) ?
>
>  Thanks in advance for your comments,
>
>  --
>  Sérgio Nunes
>

Re: Hardware Details for a Small Cluster

Posted by "S. Nunes" <sn...@gmail.com>.
Thanks for your comments!
Our research lab is mostly focused on NLP and IR. So we are aiming at
good throughput and also a reasonable storage capacity ~3/4TB.

--
Sérgio Nunes

On Wed, Mar 5, 2008 at 6:51 PM, Ted Dunning <td...@veoh.com> wrote:
>
>  The right answer really depends on your workload and what your needs and
>  goals are.
>
>  You say that this is a research lab.  If you are researching parallel
>  algorithms, then I would recommend much higher parallelism.
>
>  If you are working on problems where you want throughput, then the answer
>  may be a bit different.  In that case, the two major considerations are
>  aggregate disk speed (proportional to number of drives/interfaces) and
>  aggregate CPU speed.  Much of my work load is disk limited so I find having
>  more machines each with a disk to saturate is a good idea.  Having
>  completely anemic CPU's is not very helpful, however.
>
>  Assuming that you are only concerned with purchase cost, I would tend to
>  recommend single CPU, dual core machines with a decently fast 64 bit CPU
>  (opteron or xeon), each with 500GB drives.  Depending on your luck, you may
>  be able to get dual CPU's for 4 cores per box for a similar price.  Getting
>  two slightly smaller disks would probably give you better throughput for
>  very slighly higher cost.
>
>  If you are considering life-cycle costs then you may come up with slightly
>  different configurations due to rack density.  Blades don't generally have
>  very large disks so to get to 5TB, you may require a lot of blades.
>
>  Homoegeneity is not a huge issue.  I have a cluster with 4 pretty hot Xeon
>  cores on some boxes and 2 lousy cores on other boxes and the long tail
>  phenomenon does not come up all that much because the file splits are fine
>  enough that it all works out in the end.
>
>  If this is for learning about parallel processing and if somebody else is
>  paying your power bills you should consider getting used machines or
>  machines from a place like Dell outlet.  Many of the machines that you get
>  that way are considerably lower cost and would provide comparable
>  disk/network bandwidth to new machines.  Ebay has bunches of Dell 1850's for
>  sale, for instance.
>
>
>
>  On 3/5/08 7:16 AM, "S. Nunes" <sn...@gmail.com> wrote:
>
>  > Hi,
>  >
>  > I'm trying to deploy a small Hadoop cluster for our research lab.
>  > We are in the process of selecting the hardware for this cluster. We
>  > are aiming at a 12 CPU, 5 TB cluster. This is obviously a very rough
>  > estimation.
>  >
>  > I have a few questions and I would greatly appreciate your feedback.
>  >
>  > Which is better, a cluster based on many low performance nodes; or a
>  > cluster with fewer but high performance nodes? For instance, should I
>  > bet on a cluster with 4 nodes (1 CPU + 100 GB each) or on a cluster
>  > with 2 nodes (2 CPU + 200 GB each)?
>  >
>  > What should be considered regarding node homogeneity? I understand
>  > that a very unbalanced cluster would result in a "long tailed"
>  > performance - slower nodes would penalize the overall performance.
>  > However, how critical is that? Do you have performance numbers to
>  > support our decision?
>  >
>  > Finally, do you recommend any specific hardware configuration for
>  > starting a cluster (rack, blade, tower...) ?
>  >
>  > Thanks in advance for your comments,
>  >
>  > --
>  > Sérgio Nunes
>
>

Re: Hardware Details for a Small Cluster

Posted by Ted Dunning <td...@veoh.com>.
The right answer really depends on your workload and what your needs and
goals are.

You say that this is a research lab.  If you are researching parallel
algorithms, then I would recommend much higher parallelism.

If you are working on problems where you want throughput, then the answer
may be a bit different.  In that case, the two major considerations are
aggregate disk speed (proportional to number of drives/interfaces) and
aggregate CPU speed.  Much of my work load is disk limited so I find having
more machines each with a disk to saturate is a good idea.  Having
completely anemic CPU's is not very helpful, however.

Assuming that you are only concerned with purchase cost, I would tend to
recommend single CPU, dual core machines with a decently fast 64 bit CPU
(opteron or xeon), each with 500GB drives.  Depending on your luck, you may
be able to get dual CPU's for 4 cores per box for a similar price.  Getting
two slightly smaller disks would probably give you better throughput for
very slighly higher cost.

If you are considering life-cycle costs then you may come up with slightly
different configurations due to rack density.  Blades don't generally have
very large disks so to get to 5TB, you may require a lot of blades.

Homoegeneity is not a huge issue.  I have a cluster with 4 pretty hot Xeon
cores on some boxes and 2 lousy cores on other boxes and the long tail
phenomenon does not come up all that much because the file splits are fine
enough that it all works out in the end.

If this is for learning about parallel processing and if somebody else is
paying your power bills you should consider getting used machines or
machines from a place like Dell outlet.  Many of the machines that you get
that way are considerably lower cost and would provide comparable
disk/network bandwidth to new machines.  Ebay has bunches of Dell 1850's for
sale, for instance.

On 3/5/08 7:16 AM, "S. Nunes" <sn...@gmail.com> wrote:

> Hi,
> 
> I'm trying to deploy a small Hadoop cluster for our research lab.
> We are in the process of selecting the hardware for this cluster. We
> are aiming at a 12 CPU, 5 TB cluster. This is obviously a very rough
> estimation.
> 
> I have a few questions and I would greatly appreciate your feedback.
> 
> Which is better, a cluster based on many low performance nodes; or a
> cluster with fewer but high performance nodes? For instance, should I
> bet on a cluster with 4 nodes (1 CPU + 100 GB each) or on a cluster
> with 2 nodes (2 CPU + 200 GB each)?
> 
> What should be considered regarding node homogeneity? I understand
> that a very unbalanced cluster would result in a "long tailed"
> performance - slower nodes would penalize the overall performance.
> However, how critical is that? Do you have performance numbers to
> support our decision?
> 
> Finally, do you recommend any specific hardware configuration for
> starting a cluster (rack, blade, tower...) ?
> 
> Thanks in advance for your comments,
> 
> --
> Sérgio Nunes