You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by David Parks <da...@yahoo.com> on 2013/05/11 08:30:11 UTC

What's the best disk configuration for hadoop? SSD's Raid levels, etc?

We've got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
disk slots max), they chug away ok currently, only slightly IO bound on
average.

 

I'm going to upgrade the disk configuration at some point (we do need more
space on HDFS) and I'm thinking about what's best hardware-wise:

 

.         Would it be wise to use one of the three disk slots for a 1TB SSD?
I wouldn't use it for HDFS, but for map-output and sorting it might make a
big difference no?

.         If I put in either 1 or 2 more 4TB disks for HDFS, should I RAID-0
them for speed, or will HDFS balance well across multiple partitions on its
own?

.         Would anyone suggest 3 4TB disks and a RAID-5 configuration to
guard against disk replacements over the above options?

 

Dave

 


Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi David,


your first point :

the role of thumb is : "one disk per CPU (or per 1.5 to 2 CPU)" in your
case more parrallel IO could be possible with more disks, but
as you wrote, you have less IO bound processing things might change and a
SSD could speed up shuffle & sort phase, but I suggest to do
some benchmarking with "Starfish". This can tell you the difference between
a std. disk an SSD disk.

2.) I would not recommend RAID in a Hadoop box. Hadoop does the
distribution and also the redundancy (like RAID does) but spread across
multiple machines.

3.) You can configure, how many disks are allowed to fail without crashing
a whole node. So you can replace the drive and block redistribution
is done also automaticly as a part of the self healing features of HDFS.

Best wishes
Mirko



2013/5/11 David Parks <da...@yahoo.com>

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Ted Dunning <td...@maprtech.com>.
This sounds (with no real evidence) like you are a bit light on memory for
that number of cores.  That could cause you to be spilling map outputs
early and very much slowing things down.


On Fri, May 10, 2013 at 11:30 PM, David Parks <da...@yahoo.com>wrote:

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Ted Dunning <td...@maprtech.com>.
This sounds (with no real evidence) like you are a bit light on memory for
that number of cores.  That could cause you to be spilling map outputs
early and very much slowing things down.


On Fri, May 10, 2013 at 11:30 PM, David Parks <da...@yahoo.com>wrote:

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi David,


your first point :

the role of thumb is : "one disk per CPU (or per 1.5 to 2 CPU)" in your
case more parrallel IO could be possible with more disks, but
as you wrote, you have less IO bound processing things might change and a
SSD could speed up shuffle & sort phase, but I suggest to do
some benchmarking with "Starfish". This can tell you the difference between
a std. disk an SSD disk.

2.) I would not recommend RAID in a Hadoop box. Hadoop does the
distribution and also the redundancy (like RAID does) but spread across
multiple machines.

3.) You can configure, how many disks are allowed to fail without crashing
a whole node. So you can replace the drive and block redistribution
is done also automaticly as a part of the self healing features of HDFS.

Best wishes
Mirko



2013/5/11 David Parks <da...@yahoo.com>

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Ted Dunning <td...@maprtech.com>.
This sounds (with no real evidence) like you are a bit light on memory for
that number of cores.  That could cause you to be spilling map outputs
early and very much slowing things down.


On Fri, May 10, 2013 at 11:30 PM, David Parks <da...@yahoo.com>wrote:

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi David,


your first point :

the role of thumb is : "one disk per CPU (or per 1.5 to 2 CPU)" in your
case more parrallel IO could be possible with more disks, but
as you wrote, you have less IO bound processing things might change and a
SSD could speed up shuffle & sort phase, but I suggest to do
some benchmarking with "Starfish". This can tell you the difference between
a std. disk an SSD disk.

2.) I would not recommend RAID in a Hadoop box. Hadoop does the
distribution and also the redundancy (like RAID does) but spread across
multiple machines.

3.) You can configure, how many disks are allowed to fail without crashing
a whole node. So you can replace the drive and block redistribution
is done also automaticly as a part of the self healing features of HDFS.

Best wishes
Mirko



2013/5/11 David Parks <da...@yahoo.com>

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Ted Dunning <td...@maprtech.com>.
This sounds (with no real evidence) like you are a bit light on memory for
that number of cores.  That could cause you to be spilling map outputs
early and very much slowing things down.


On Fri, May 10, 2013 at 11:30 PM, David Parks <da...@yahoo.com>wrote:

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi David,


your first point :

the role of thumb is : "one disk per CPU (or per 1.5 to 2 CPU)" in your
case more parrallel IO could be possible with more disks, but
as you wrote, you have less IO bound processing things might change and a
SSD could speed up shuffle & sort phase, but I suggest to do
some benchmarking with "Starfish". This can tell you the difference between
a std. disk an SSD disk.

2.) I would not recommend RAID in a Hadoop box. Hadoop does the
distribution and also the redundancy (like RAID does) but spread across
multiple machines.

3.) You can configure, how many disks are allowed to fail without crashing
a whole node. So you can replace the drive and block redistribution
is done also automaticly as a part of the self healing features of HDFS.

Best wishes
Mirko



2013/5/11 David Parks <da...@yahoo.com>

> We’ve got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
> disk slots max), they chug away ok currently, only slightly IO bound on
> average.****
>
> ** **
>
> I’m going to upgrade the disk configuration at some point (we do need more
> space on HDFS) and I’m thinking about what’s best hardware-wise:****
>
> ** **
>
> **·         **Would it be wise to use one of the three disk slots for a
> 1TB SSD?  I wouldn’t use it for HDFS, but for map-output and sorting it
> might make a big difference no?****
>
> **·         **If I put in either 1 or 2 more 4TB disks for HDFS, should I
> RAID-0 them for speed, or will HDFS balance well across multiple partitions
> on its own?****
>
> **·         **Would anyone suggest 3 4TB disks and a RAID-5 configuration
> to guard against disk replacements over the above options?****
>
> ** **
>
> Dave****
>
> ** **
>