You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "David B. Ritch" <da...@gmail.com> on 2009/01/11 22:23:01 UTC

RAID vs. JBOD

How well does Hadoop handle multiple independent disks per node?

I have a cluster with 4 identical disks per node.  I plan to use one
disk for OS and temporary storage, and dedicate the other three to
HDFS.  Our IT folks have some disagreement as to whether the three disks
should be striped, or treated by HDFS as three independent disks.  Could
someone with more HDFS experience comment on the relative advantages and
disadvantages to each approach?

Here are some of my thoughts.  It's a bit easier to manage a 3-disk
striped partition, and we wouldn't have to worry about balancing files
between them.  Single-file I/O should be considerably faster.  On the
other hand, I would expect typical use to require multiple files reads
or write simultaneously.  I would expect Hadoop to be able to manage
read/write to/from the disks independently.  Managing 3 streams to 3
independent devices would likely result in less disk head movement, and
therefore better performance.  I would expect Hadoop to be able to
balance load between the disks fairly well.  Availability doesn't really
differentiate between the two approaches - if a single disk dies, the
striped array would go down, but all its data should be replicated on
another datanode, anyway.  And besides, I understand that datanode will
shut down a node, even if only one of 3 independent disks crashes.

So - any one want to agree or disagree with these thoughts?  Anyone have
any other ideas, or - better - benchmarks and experience with layouts
like these two?

Thanks!

David

Re: RAID vs. JBOD

Posted by Brian Vargas <br...@ardvaark.net>.

David,

As I understand it, you will theoretically get better performance from a
JBOD configuration than a RAID configuration.  In a RAID configuration,
you have to wait for the slowest disk in the array to complete before
the entire IO operation can complete, making the average IO time
equivalent to the slowest disk.  In a JBOD configuration, operations on
a faster disks will complete independently of the slowest disk, making
the average IO time for the node necessarily faster than the slowest
disk (unless all disks are equally slow).

Whether it would be a noticeable gain is questionable, though.  I doubt
it would make enough difference to provide a good reason to depart from
whichever you feel is easiest to manage.

And you don't need the redundancy of RAID, since HDFS does that using
replication between nodes, so there's no loss there.

Brian

David B. Ritch wrote:
> Thank you - yes, I'm fairly confident that it will work either way.  I'm
> trying to find out whether there is an established best practice, and
> the performance impact of the decision between RAID 0 and JBOD.
> I'll check out the noatime and nodiratime for their effect on our
> performance - thanks for that suggestion, as well.
> 
> David
> Jason Venner wrote:
>> If you put your dfs directory as a set of comma separated tokens you
>> will do fine.
>>
>> <property>
>>  <name>dfs.data.dir</name>
>>  <value>${hadoop.tmp.dir}/dfs/data</value>
>>  <description>Determines where on the local filesystem an DFS data node
>>  should store its blocks.  If this is a comma-delimited
>>  list of directories, then data will be stored in all named
>>  directories, typically on different devices.
>>  Directories that do not exist are ignored.
>>  </description>
>> </property>
>>
>> The namenode does a lot of small writes, so raid 1, 10 is better.
>>
>> Also it having the file system mounts for the dfs.data.dir be noatime
>> and nodiratime makes a significant performance difference.
>>
>> David B. Ritch wrote:
>>> How well does Hadoop handle multiple independent disks per node?
>>>
>>> I have a cluster with 4 identical disks per node.  I plan to use one
>>> disk for OS and temporary storage, and dedicate the other three to
>>> HDFS.  Our IT folks have some disagreement as to whether the three disks
>>> should be striped, or treated by HDFS as three independent disks.  Could
>>> someone with more HDFS experience comment on the relative advantages and
>>> disadvantages to each approach?
>>>
>>> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
>>> striped partition, and we wouldn't have to worry about balancing files
>>> between them.  Single-file I/O should be considerably faster.  On the
>>> other hand, I would expect typical use to require multiple files reads
>>> or write simultaneously.  I would expect Hadoop to be able to manage
>>> read/write to/from the disks independently.  Managing 3 streams to 3
>>> independent devices would likely result in less disk head movement, and
>>> therefore better performance.  I would expect Hadoop to be able to
>>> balance load between the disks fairly well.  Availability doesn't really
>>> differentiate between the two approaches - if a single disk dies, the
>>> striped array would go down, but all its data should be replicated on
>>> another datanode, anyway.  And besides, I understand that datanode will
>>> shut down a node, even if only one of 3 independent disks crashes.
>>>
>>> So - any one want to agree or disagree with these thoughts?  Anyone have
>>> any other ideas, or - better - benchmarks and experience with layouts
>>> like these two?
>>>
>>> Thanks!
>>>
>>> David
>>>   
>

Re: RAID vs. JBOD

Posted by "David B. Ritch" <da...@gmail.com>.

Thank you - yes, I'm fairly confident that it will work either way.  I'm
trying to find out whether there is an established best practice, and
the performance impact of the decision between RAID 0 and JBOD.
I'll check out the noatime and nodiratime for their effect on our
performance - thanks for that suggestion, as well.

David
Jason Venner wrote:
> If you put your dfs directory as a set of comma separated tokens you
> will do fine.
>
> <property>
>  <name>dfs.data.dir</name>
>  <value>${hadoop.tmp.dir}/dfs/data</value>
>  <description>Determines where on the local filesystem an DFS data node
>  should store its blocks.  If this is a comma-delimited
>  list of directories, then data will be stored in all named
>  directories, typically on different devices.
>  Directories that do not exist are ignored.
>  </description>
> </property>
>
> The namenode does a lot of small writes, so raid 1, 10 is better.
>
> Also it having the file system mounts for the dfs.data.dir be noatime
> and nodiratime makes a significant performance difference.
>
> David B. Ritch wrote:
>> How well does Hadoop handle multiple independent disks per node?
>>
>> I have a cluster with 4 identical disks per node.  I plan to use one
>> disk for OS and temporary storage, and dedicate the other three to
>> HDFS.  Our IT folks have some disagreement as to whether the three disks
>> should be striped, or treated by HDFS as three independent disks.  Could
>> someone with more HDFS experience comment on the relative advantages and
>> disadvantages to each approach?
>>
>> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
>> striped partition, and we wouldn't have to worry about balancing files
>> between them.  Single-file I/O should be considerably faster.  On the
>> other hand, I would expect typical use to require multiple files reads
>> or write simultaneously.  I would expect Hadoop to be able to manage
>> read/write to/from the disks independently.  Managing 3 streams to 3
>> independent devices would likely result in less disk head movement, and
>> therefore better performance.  I would expect Hadoop to be able to
>> balance load between the disks fairly well.  Availability doesn't really
>> differentiate between the two approaches - if a single disk dies, the
>> striped array would go down, but all its data should be replicated on
>> another datanode, anyway.  And besides, I understand that datanode will
>> shut down a node, even if only one of 3 independent disks crashes.
>>
>> So - any one want to agree or disagree with these thoughts?  Anyone have
>> any other ideas, or - better - benchmarks and experience with layouts
>> like these two?
>>
>> Thanks!
>>
>> David
>>   
>

Re: RAID vs. JBOD

Posted by Jason Venner <ja...@attributor.com>.

If you put your dfs directory as a set of comma separated tokens you 
will do fine.

<property>
  <name>dfs.data.dir</name>
  <value>${hadoop.tmp.dir}/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

The namenode does a lot of small writes, so raid 1, 10 is better.

Also it having the file system mounts for the dfs.data.dir be noatime 
and nodiratime makes a significant performance difference.

David B. Ritch wrote:
> How well does Hadoop handle multiple independent disks per node?
>
> I have a cluster with 4 identical disks per node.  I plan to use one
> disk for OS and temporary storage, and dedicate the other three to
> HDFS.  Our IT folks have some disagreement as to whether the three disks
> should be striped, or treated by HDFS as three independent disks.  Could
> someone with more HDFS experience comment on the relative advantages and
> disadvantages to each approach?
>
> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
> striped partition, and we wouldn't have to worry about balancing files
> between them.  Single-file I/O should be considerably faster.  On the
> other hand, I would expect typical use to require multiple files reads
> or write simultaneously.  I would expect Hadoop to be able to manage
> read/write to/from the disks independently.  Managing 3 streams to 3
> independent devices would likely result in less disk head movement, and
> therefore better performance.  I would expect Hadoop to be able to
> balance load between the disks fairly well.  Availability doesn't really
> differentiate between the two approaches - if a single disk dies, the
> striped array would go down, but all its data should be replicated on
> another datanode, anyway.  And besides, I understand that datanode will
> shut down a node, even if only one of 3 independent disks crashes.
>
> So - any one want to agree or disagree with these thoughts?  Anyone have
> any other ideas, or - better - benchmarks and experience with layouts
> like these two?
>
> Thanks!
>
> David
>

Re: RAID vs. JBOD

Posted by John Kane <jo...@gmail.com>.

We have been using 2U boxes with 12x1TB disks. The first disk is used for
OS/Scratch/Laziness, the other 11 disks are formatted as individual (~900GB)
volumes and then mounted separately. We have /data-[a-k] mounted and
configured in our cluster and have not had any issues with unbalanced
loading. We do have varied sizes (from small to really large) files and
hadoop just seems to figure it out for us.

We have had single drive failures. We just bounce the datanode software and
all is happy. When we get around to replacing the failed drive (I just about
to go do one), we format it, mount it and then bounce the datanode. That
replacement volume is now not very well balanced, but it is not typically an
issue for us, we add lots of data every day so it does get filled up. We
have run the rebalancer to address huge disparites in node utilization (like
when we add a new node). That just made us feel better more than anything
else.

Cheers

On Mon, Jan 12, 2009 at 2:00 PM, David Ritch <da...@gmail.com> wrote:

> Thank you!  I'm glad to hear that you have actually tested this.
>
> I believe that a failure of any disk - even with JBOD - will cause dataNode
> to bring the node down.  Presumably, we could bring it right back up, but
> this does sort of diminish the availability argument for JBOD.
>
> Sounds like it's basically a toss-up.  I'm a bit concerned about the
> potential for uneven distribution - both of amount of data, and of transfer
> load - across the spindles.  Unless I hear otherwise, I will probably go
> with RAID-0.
>
> On Mon, Jan 12, 2009 at 12:17 PM, Colin Evans <co...@metaweb.com> wrote:
>
> > Currently, Hadoop does round-robin allocation of blocks and data across
> > multiple JBOD disks.  We did some testing and found that there weren't
> > significant differences between RAID-0 and JBOD.  We went with JBOD
> because
> > we figured that RAID-0 has a higher failure rate than JBOD -- any disk
> > failure in a 3-disk RAID-0 configuration causes the whole node to go
> down,
> > but if there is a single disk failure in a JBOD configuration, Hadoop
> will
> > go on serving from the other disks.
>

Re: RAID vs. JBOD

Posted by David Ritch <da...@gmail.com>.

Thank you!  I'm glad to hear that you have actually tested this.

I believe that a failure of any disk - even with JBOD - will cause dataNode
to bring the node down.  Presumably, we could bring it right back up, but
this does sort of diminish the availability argument for JBOD.

Sounds like it's basically a toss-up.  I'm a bit concerned about the
potential for uneven distribution - both of amount of data, and of transfer
load - across the spindles.  Unless I hear otherwise, I will probably go
with RAID-0.

On Mon, Jan 12, 2009 at 12:17 PM, Colin Evans <co...@metaweb.com> wrote:

> Currently, Hadoop does round-robin allocation of blocks and data across
> multiple JBOD disks.  We did some testing and found that there weren't
> significant differences between RAID-0 and JBOD.  We went with JBOD because
> we figured that RAID-0 has a higher failure rate than JBOD -- any disk
> failure in a 3-disk RAID-0 configuration causes the whole node to go down,
> but if there is a single disk failure in a JBOD configuration, Hadoop will
> go on serving from the other disks.

Re: RAID vs. JBOD

Posted by Colin Evans <co...@metaweb.com>.

Currently, Hadoop does round-robin allocation of blocks and data  
across multiple JBOD disks.  We did some testing and found that there  
weren't significant differences between RAID-0 and JBOD.  We went with  
JBOD because we figured that RAID-0 has a higher failure rate than  
JBOD -- any disk failure in a 3-disk RAID-0 configuration causes the  
whole node to go down, but if there is a single disk failure in a JBOD  
configuration, Hadoop will go on serving from the other disks.



On Jan 11, 2009, at 1:23 PM, David B. Ritch wrote:

> How well does Hadoop handle multiple independent disks per node?
>
> I have a cluster with 4 identical disks per node.  I plan to use one
> disk for OS and temporary storage, and dedicate the other three to
> HDFS.  Our IT folks have some disagreement as to whether the three  
> disks
> should be striped, or treated by HDFS as three independent disks.   
> Could
> someone with more HDFS experience comment on the relative advantages  
> and
> disadvantages to each approach?
>
> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
> striped partition, and we wouldn't have to worry about balancing files
> between them.  Single-file I/O should be considerably faster.  On the
> other hand, I would expect typical use to require multiple files reads
> or write simultaneously.  I would expect Hadoop to be able to manage
> read/write to/from the disks independently.  Managing 3 streams to 3
> independent devices would likely result in less disk head movement,  
> and
> therefore better performance.  I would expect Hadoop to be able to
> balance load between the disks fairly well.  Availability doesn't  
> really
> differentiate between the two approaches - if a single disk dies, the
> striped array would go down, but all its data should be replicated on
> another datanode, anyway.  And besides, I understand that datanode  
> will
> shut down a node, even if only one of 3 independent disks crashes.
>
> So - any one want to agree or disagree with these thoughts?  Anyone  
> have
> any other ideas, or - better - benchmarks and experience with layouts
> like these two?
>
> Thanks!
>
> David

Re: RAID vs. JBOD

Posted by Runping Qi <ru...@yahoo-inc.com>.

Yes, all the machines in the tests are new, with the same spec.
The 30% to 50% throughput variations of the disks were observed on the disks
of the same machines.

Runping



On 1/15/09 2:41 AM, "Steve Loughran" <st...@apache.org> wrote:

> Runping Qi wrote:
>> Hi,
>> 
>> We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
>> and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
>> performed better.
>> 
>> Gridmix tests:
>> 
>> Load: gridmix2
>> Cluster size: 190 nodes
>> Test results:
>> 
>> RAID0: 75 minutes
>> JBOD:  67 minutes
>> Difference: 10%
>> 
>> Tests on HDFS writes performances
>> 
>> We ran map only jobs writing data to dfs concurrently on different clusters.
>> The overall dfs write throughputs on the jbod cluster are 30% (with a 58
>> nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
>> raid0 cluster, respectively.
>> 
>> To understand why, we did some file level benchmarking on both clusters.
>> We found that the file write throughput on a JBOD machine is 30% higher than
>> that on a comparable machine with RAID0. This performance difference may be
>> explained by the fact that the throughputs of different disks can vary 30%
>> to 50%. With such variations, the overall throughput of a raid0 system may
>> be bottlenecked by the slowest disk.
>>  
>> 
>> -- Runping
>> 
> 
> This is really interesting. Thank you for sharing these results!
> 
> Presumably the servers were all set up with "nominally" homogenous
> hardware? And yet still the variations existed. That would be something
> to experiment with on new versus old clusters to see if it gets worse
> over time.
> 
> Here we have a batch of desktop workstations all bought at the same
> time, to the same spec, but one of them, "lucky" is more prone to race
> conditions than any of the others. We don't know why, and assume its do
> with the (multiple) Xeon CPU chips being at different ends of the bell
> curve or something. all we know is: test on that box before shipping to
> find race conditions early.
> 
> -steve

Re: RAID vs. JBOD

Posted by Steve Loughran <st...@apache.org>.

Runping Qi wrote:
> Hi,
> 
> We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
> and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
> performed better.
> 
> Gridmix tests:
> 
> Load: gridmix2
> Cluster size: 190 nodes
> Test results:
> 
> RAID0: 75 minutes
> JBOD:  67 minutes
> Difference: 10%
> 
> Tests on HDFS writes performances
> 
> We ran map only jobs writing data to dfs concurrently on different clusters.
> The overall dfs write throughputs on the jbod cluster are 30% (with a 58
> nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
> raid0 cluster, respectively.
> 
> To understand why, we did some file level benchmarking on both clusters.
> We found that the file write throughput on a JBOD machine is 30% higher than
> that on a comparable machine with RAID0. This performance difference may be
> explained by the fact that the throughputs of different disks can vary 30%
> to 50%. With such variations, the overall throughput of a raid0 system may
> be bottlenecked by the slowest disk.
>  
> 
> -- Runping
> 

This is really interesting. Thank you for sharing these results!

Presumably the servers were all set up with "nominally" homogenous 
hardware? And yet still the variations existed. That would be something 
to experiment with on new versus old clusters to see if it gets worse 
over time.

Here we have a batch of desktop workstations all bought at the same 
time, to the same spec, but one of them, "lucky" is more prone to race 
conditions than any of the others. We don't know why, and assume its do 
with the (multiple) Xeon CPU chips being at different ends of the bell 
curve or something. all we know is: test on that box before shipping to 
find race conditions early.

-steve

Re: RAID vs. JBOD

Posted by Runping Qi <ru...@yahoo-inc.com>.

Hi,

We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
performed better.

Gridmix tests:

Load: gridmix2
Cluster size: 190 nodes
Test results:

RAID0: 75 minutes
JBOD:  67 minutes
Difference: 10%

Tests on HDFS writes performances

We ran map only jobs writing data to dfs concurrently on different clusters.
The overall dfs write throughputs on the jbod cluster are 30% (with a 58
nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
raid0 cluster, respectively.

To understand why, we did some file level benchmarking on both clusters.
We found that the file write throughput on a JBOD machine is 30% higher than
that on a comparable machine with RAID0. This performance difference may be
explained by the fact that the throughputs of different disks can vary 30%
to 50%. With such variations, the overall throughput of a raid0 system may
be bottlenecked by the slowest disk.

-- Runping

On 1/11/09 1:23 PM, "David B. Ritch" <da...@gmail.com> wrote:

> How well does Hadoop handle multiple independent disks per node?
> 
> I have a cluster with 4 identical disks per node.  I plan to use one
> disk for OS and temporary storage, and dedicate the other three to
> HDFS.  Our IT folks have some disagreement as to whether the three disks
> should be striped, or treated by HDFS as three independent disks.  Could
> someone with more HDFS experience comment on the relative advantages and
> disadvantages to each approach?
> 
> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
> striped partition, and we wouldn't have to worry about balancing files
> between them.  Single-file I/O should be considerably faster.  On the
> other hand, I would expect typical use to require multiple files reads
> or write simultaneously.  I would expect Hadoop to be able to manage
> read/write to/from the disks independently.  Managing 3 streams to 3
> independent devices would likely result in less disk head movement, and
> therefore better performance.  I would expect Hadoop to be able to
> balance load between the disks fairly well.  Availability doesn't really
> differentiate between the two approaches - if a single disk dies, the
> striped array would go down, but all its data should be replicated on
> another datanode, anyway.  And besides, I understand that datanode will
> shut down a node, even if only one of 3 independent disks crashes.
> 
> So - any one want to agree or disagree with these thoughts?  Anyone have
> any other ideas, or - better - benchmarks and experience with layouts
> like these two?
> 
> Thanks!
> 
> David

Re: RAID vs. JBOD

Posted by Runping Qi <ru...@yahoo-inc.com>.

Hi,

We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
performed better.

Gridmix tests:

Load: gridmix2
Cluster size: 190 nodes
Test results:

RAID0: 75 minutes
JBOD:  67 minutes
Difference: 10%

Tests on HDFS writes performances

We ran map only jobs writing data to dfs concurrently on different clusters.
The overall dfs write throughputs on the jbod cluster are 30% (with a 58
nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
raid0 cluster, respectively.

To understand why, we did some file level benchmarking on both clusters.
We found that the file write throughput on a JBOD machine is 30% higher than
that on a comparable machine with RAID0. This performance difference may be
explained by the fact that the throughputs of different disks can vary 30%
to 50%. With such variations, the overall throughput of a raid0 system may
be bottlenecked by the slowest disk.

-- Runping

On 1/11/09 1:23 PM, "David B. Ritch" <da...@gmail.com> wrote:

> How well does Hadoop handle multiple independent disks per node?
> 
> I have a cluster with 4 identical disks per node.  I plan to use one
> disk for OS and temporary storage, and dedicate the other three to
> HDFS.  Our IT folks have some disagreement as to whether the three disks
> should be striped, or treated by HDFS as three independent disks.  Could
> someone with more HDFS experience comment on the relative advantages and
> disadvantages to each approach?
> 
> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
> striped partition, and we wouldn't have to worry about balancing files
> between them.  Single-file I/O should be considerably faster.  On the
> other hand, I would expect typical use to require multiple files reads
> or write simultaneously.  I would expect Hadoop to be able to manage
> read/write to/from the disks independently.  Managing 3 streams to 3
> independent devices would likely result in less disk head movement, and
> therefore better performance.  I would expect Hadoop to be able to
> balance load between the disks fairly well.  Availability doesn't really
> differentiate between the two approaches - if a single disk dies, the
> striped array would go down, but all its data should be replicated on
> another datanode, anyway.  And besides, I understand that datanode will
> shut down a node, even if only one of 3 independent disks crashes.
> 
> So - any one want to agree or disagree with these thoughts?  Anyone have
> any other ideas, or - better - benchmarks and experience with layouts
> like these two?
> 
> Thanks!
> 
> David