You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by tim robertson <ti...@gmail.com> on 2009/04/02 16:02:27 UTC

Hardware - please sanity check?

Hi all,

I am not a hardware guy but about to set up a 10 node cluster for some
processing of (mostly) tab files, generating various indexes and
researching HBase, Mahout, pig, hive etc.

Could someone please sanity check that these specs look sensible?
[I know 4 drives would be better but price is a factor (second hand
not an option, hosting is not either as there is very good bandwidth
provided)]

Something along the lines of:

Dell R200 (8GB is max memory)
Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


Dell R300 (can be expanded to 24GB RAM)
Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


If there is a major flaw please can you let me know.

Thanks,

Tim
(not a hardware guy ;o)

Re: Hardware - please sanity check?

Posted by Philip Zeyliger <ph...@cloudera.com>.
>
>
> I've been assuming that RAID is generally a good idea (disks fail quite
> often, and it's cheaper to hotswap a drive than to rebuild an entire box).
>

Hadoop data nodes are often configured without RAID (i.e., "JBOD" = Just a
Bunch of Disks)--HDFS already provides for the data redundancy.  Also, if
you stripe across disks, you're liable to be as slow as the slowest of your
disks, so data nodes are typically configured to point to multiple disks.

-- Philip

Re: Hardware - please sanity check?

Posted by Patrick Angeles <pa...@gmail.com>.
I had a similar curiosity, but more regarding disk speed.
Can I assume linear improvement between 7200rpm -> 10k rpm -> 15k rpm? How
much of a bottleneck is disk access?

Another question is regarding hardware redundancy. What is the relative
value of the following:
- RAID / hot-swappable drives
- dual NICs
- redundant backplane
- redundant power supply
- UPS

I've been assuming that RAID is generally a good idea (disks fail quite
often, and it's cheaper to hotswap a drive than to rebuild an entire box).
Dual NICs are also good, as both can be used at the same time. Everything
else is not necessary in a Hadoop cluster.

On Thu, Apr 2, 2009 at 11:33 AM, tim robertson <ti...@gmail.com>wrote:

> Thanks Miles,
>
> Thus far most of my work has been on EC2 large instances and *mostly*
> my code is not memory intensive (I sometimes do joins against polygons
> and hold Geospatial indexes in memory, but am aware of keeping things
> within the -Xmx for this).
> I am mostly  looking to move routine data processing and
> transformation (lots of distinct, count and group by operations) off a
> chunky mysql DB (200million rows and growing) which gets all locked
> up.
>
> We have gigabit switches.
>
> Cheers
>
> Tim
>
>
>
> On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> > make sure you also have a fast switch, since you will be transmitting
> > data across your network and this will come to bite you otherwise
> >
> > (roughly, you need one core per hadoop-related job, each mapper, task
> > tracker etc;  the per-core memory may be too small if you are doing
> > anything memory-intensive.  we have 8-core boxes with 50 -- 33 GB RAM
> > and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM
> > and it routinely falls over when we run jobs on it)
> >
> > Miles
> >
> > 2009/4/2 tim robertson <ti...@gmail.com>:
> >> Hi all,
> >>
> >> I am not a hardware guy but about to set up a 10 node cluster for some
> >> processing of (mostly) tab files, generating various indexes and
> >> researching HBase, Mahout, pig, hive etc.
> >>
> >> Could someone please sanity check that these specs look sensible?
> >> [I know 4 drives would be better but price is a factor (second hand
> >> not an option, hosting is not either as there is very good bandwidth
> >> provided)]
> >>
> >> Something along the lines of:
> >>
> >> Dell R200 (8GB is max memory)
> >> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
> >> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
> >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
> >>
> >>
> >> Dell R300 (can be expanded to 24GB RAM)
> >> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
> >> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
> >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
> >>
> >>
> >> If there is a major flaw please can you let me know.
> >>
> >> Thanks,
> >>
> >> Tim
> >> (not a hardware guy ;o)
> >>
> >
> >
> >
> > --
> > The University of Edinburgh is a charitable body, registered in
> > Scotland, with registration number SC005336.
> >
>

Re: Hardware - please sanity check?

Posted by tim robertson <ti...@gmail.com>.
Thanks Miles,

Thus far most of my work has been on EC2 large instances and *mostly*
my code is not memory intensive (I sometimes do joins against polygons
and hold Geospatial indexes in memory, but am aware of keeping things
within the -Xmx for this).
I am mostly  looking to move routine data processing and
transformation (lots of distinct, count and group by operations) off a
chunky mysql DB (200million rows and growing) which gets all locked
up.

We have gigabit switches.

Cheers

Tim



On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> make sure you also have a fast switch, since you will be transmitting
> data across your network and this will come to bite you otherwise
>
> (roughly, you need one core per hadoop-related job, each mapper, task
> tracker etc;  the per-core memory may be too small if you are doing
> anything memory-intensive.  we have 8-core boxes with 50 -- 33 GB RAM
> and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM
> and it routinely falls over when we run jobs on it)
>
> Miles
>
> 2009/4/2 tim robertson <ti...@gmail.com>:
>> Hi all,
>>
>> I am not a hardware guy but about to set up a 10 node cluster for some
>> processing of (mostly) tab files, generating various indexes and
>> researching HBase, Mahout, pig, hive etc.
>>
>> Could someone please sanity check that these specs look sensible?
>> [I know 4 drives would be better but price is a factor (second hand
>> not an option, hosting is not either as there is very good bandwidth
>> provided)]
>>
>> Something along the lines of:
>>
>> Dell R200 (8GB is max memory)
>> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
>> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
>> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>>
>>
>> Dell R300 (can be expanded to 24GB RAM)
>> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
>> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
>> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>>
>>
>> If there is a major flaw please can you let me know.
>>
>> Thanks,
>>
>> Tim
>> (not a hardware guy ;o)
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>

Re: Hardware - please sanity check?

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
make sure you also have a fast switch, since you will be transmitting
data across your network and this will come to bite you otherwise

(roughly, you need one core per hadoop-related job, each mapper, task
tracker etc;  the per-core memory may be too small if you are doing
anything memory-intensive.  we have 8-core boxes with 50 -- 33 GB RAM
and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM
and it routinely falls over when we run jobs on it)

Miles

2009/4/2 tim robertson <ti...@gmail.com>:
> Hi all,
>
> I am not a hardware guy but about to set up a 10 node cluster for some
> processing of (mostly) tab files, generating various indexes and
> researching HBase, Mahout, pig, hive etc.
>
> Could someone please sanity check that these specs look sensible?
> [I know 4 drives would be better but price is a factor (second hand
> not an option, hosting is not either as there is very good bandwidth
> provided)]
>
> Something along the lines of:
>
> Dell R200 (8GB is max memory)
> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>
>
> Dell R300 (can be expanded to 24GB RAM)
> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>
>
> If there is a major flaw please can you let me know.
>
> Thanks,
>
> Tim
> (not a hardware guy ;o)
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.