You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Ulul <ha...@ulul.org> on 2014/10/01 23:01:47 UTC

Hadoop and RAID 5

Dear hadoopers,

Has anyone been confronted to deploying a cluster in a traditional IT 
shop whose admins handle thousands of servers ?
They traditionally use SAN or NAS storage for app data, rely on RAID 1 
for system disks and in the few cases where internal disks are used, 
they configure them with RAID 5 provided by the internal HW controller.

Using a JBOD setup , as advised in each and every Hadoop doc I ever laid 
my hands on, means that each HDD failure will imply, on top of the 
physical replacement of the drive, that an admin performs at least an mkfs.
Added to the fact that these operations will become more frequent since 
more internal disks will be used, it can be perceived as an annoying 
disruption in industrial handling of numerous servers.

In Tom White's guide there is a discussion of RAID 0, stating that Yahoo 
benchmarks showed a 10% loss in performance so we can expect even worse 
perf with RAID 5 but I found no figures.

I also found an Hortonworks interview of StackIQ who provides software 
to automate such failure fix up. But it would be rather painful to go 
straight to another solution, contract and so on while starting with Hadoop.

Please share your experiences around RAID for redundancy (1, 5 or other) 
in Hadoop conf.

Thank you
Ulul

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Yes, I also read in P420 user guide that it was RAID only. We'll live 
with it I guess...
Thanks for the HP/Cloudera link, it's precious reading !

Le 07/10/2014 08:52, Travis a écrit :
>
>
> On Sun, Oct 5, 2014 at 4:17 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Hi Travis
>
>     Thank you for your detailed answer and for honoring my question
>     with a blog entry :-)
>
>
> No problem.  I had been meaning to write something up. Thanks for the 
> prod. :-)
>
>
>     I will look into bus quiescing with admins but I'm under the
>     impression that nothing special is done, the HW RAID controller
>     taking care of everything, HP doc stating that inserting a
>     hot-pluggable disk induces one or two seconds pause in disk
>     activity. I'll check whether this is handled through the
>     controller cache and/or done out of business hours for safety.
>
>     I'll ask for internal benchmarking hoping it will convince
>     everyone to accept the JBOD model and automate what's necessary
>     for it not to disrupt operations
>
>
> Make sure your HP disk controllers can actually do JBOD.  Last I 
> looked (admittedly, this was ~3 years ago), you could only simulate it 
> doing multiple single-disk RAID0 LUNs. Operationally, these were one 
> level annoying more than just JBOD because now you had to 
> remove/destroy the RAID0 LUN when replacing the disk before you could 
> recreate the filesystem.
>
> At the very least, HP's reference architecture for Cloudera Hadoop 
> 5.X, updated in August 2014, shows that this is still the case.
>
> From their document:
>
>          Drives should use a Just a Bunch of Disks 
>
>         (JBOD) configuration, which can be achieved with the HP Smart
>         Array P420i controller by configuring each individual disk as 
>
>         a separate RAID 0 volume. Additionally array acceleration
>         features on the P420i should be turned off for the RAID 0 data 
>
>         volumes. The first two positions on the drive cage allow the
>         OS drive to be placed in RAID1. 
>
>
> http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en
>
> If you've already got equipment, then probably not a big deal.  If 
> you're in the process of evaluating new stuff, I'd ask your HP 
> Var/Reseller if there was a different, non-RAID option, especially if 
> they have something similar to the LSI-9207-8i card which uses the 
> LSI2308 SAS chip.  I know Dell validated this with their 12G equipment 
> for use with Hadoop earlier this year.  Definitely a great card so far.
>
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Yes, I also read in P420 user guide that it was RAID only. We'll live 
with it I guess...
Thanks for the HP/Cloudera link, it's precious reading !

Le 07/10/2014 08:52, Travis a écrit :
>
>
> On Sun, Oct 5, 2014 at 4:17 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Hi Travis
>
>     Thank you for your detailed answer and for honoring my question
>     with a blog entry :-)
>
>
> No problem.  I had been meaning to write something up. Thanks for the 
> prod. :-)
>
>
>     I will look into bus quiescing with admins but I'm under the
>     impression that nothing special is done, the HW RAID controller
>     taking care of everything, HP doc stating that inserting a
>     hot-pluggable disk induces one or two seconds pause in disk
>     activity. I'll check whether this is handled through the
>     controller cache and/or done out of business hours for safety.
>
>     I'll ask for internal benchmarking hoping it will convince
>     everyone to accept the JBOD model and automate what's necessary
>     for it not to disrupt operations
>
>
> Make sure your HP disk controllers can actually do JBOD.  Last I 
> looked (admittedly, this was ~3 years ago), you could only simulate it 
> doing multiple single-disk RAID0 LUNs. Operationally, these were one 
> level annoying more than just JBOD because now you had to 
> remove/destroy the RAID0 LUN when replacing the disk before you could 
> recreate the filesystem.
>
> At the very least, HP's reference architecture for Cloudera Hadoop 
> 5.X, updated in August 2014, shows that this is still the case.
>
> From their document:
>
>          Drives should use a Just a Bunch of Disks 
>
>         (JBOD) configuration, which can be achieved with the HP Smart
>         Array P420i controller by configuring each individual disk as 
>
>         a separate RAID 0 volume. Additionally array acceleration
>         features on the P420i should be turned off for the RAID 0 data 
>
>         volumes. The first two positions on the drive cage allow the
>         OS drive to be placed in RAID1. 
>
>
> http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en
>
> If you've already got equipment, then probably not a big deal.  If 
> you're in the process of evaluating new stuff, I'd ask your HP 
> Var/Reseller if there was a different, non-RAID option, especially if 
> they have something similar to the LSI-9207-8i card which uses the 
> LSI2308 SAS chip.  I know Dell validated this with their 12G equipment 
> for use with Hadoop earlier this year.  Definitely a great card so far.
>
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Yes, I also read in P420 user guide that it was RAID only. We'll live 
with it I guess...
Thanks for the HP/Cloudera link, it's precious reading !

Le 07/10/2014 08:52, Travis a écrit :
>
>
> On Sun, Oct 5, 2014 at 4:17 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Hi Travis
>
>     Thank you for your detailed answer and for honoring my question
>     with a blog entry :-)
>
>
> No problem.  I had been meaning to write something up. Thanks for the 
> prod. :-)
>
>
>     I will look into bus quiescing with admins but I'm under the
>     impression that nothing special is done, the HW RAID controller
>     taking care of everything, HP doc stating that inserting a
>     hot-pluggable disk induces one or two seconds pause in disk
>     activity. I'll check whether this is handled through the
>     controller cache and/or done out of business hours for safety.
>
>     I'll ask for internal benchmarking hoping it will convince
>     everyone to accept the JBOD model and automate what's necessary
>     for it not to disrupt operations
>
>
> Make sure your HP disk controllers can actually do JBOD.  Last I 
> looked (admittedly, this was ~3 years ago), you could only simulate it 
> doing multiple single-disk RAID0 LUNs. Operationally, these were one 
> level annoying more than just JBOD because now you had to 
> remove/destroy the RAID0 LUN when replacing the disk before you could 
> recreate the filesystem.
>
> At the very least, HP's reference architecture for Cloudera Hadoop 
> 5.X, updated in August 2014, shows that this is still the case.
>
> From their document:
>
>          Drives should use a Just a Bunch of Disks 
>
>         (JBOD) configuration, which can be achieved with the HP Smart
>         Array P420i controller by configuring each individual disk as 
>
>         a separate RAID 0 volume. Additionally array acceleration
>         features on the P420i should be turned off for the RAID 0 data 
>
>         volumes. The first two positions on the drive cage allow the
>         OS drive to be placed in RAID1. 
>
>
> http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en
>
> If you've already got equipment, then probably not a big deal.  If 
> you're in the process of evaluating new stuff, I'd ask your HP 
> Var/Reseller if there was a different, non-RAID option, especially if 
> they have something similar to the LSI-9207-8i card which uses the 
> LSI2308 SAS chip.  I know Dell validated this with their 12G equipment 
> for use with Hadoop earlier this year.  Definitely a great card so far.
>
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Yes, I also read in P420 user guide that it was RAID only. We'll live 
with it I guess...
Thanks for the HP/Cloudera link, it's precious reading !

Le 07/10/2014 08:52, Travis a écrit :
>
>
> On Sun, Oct 5, 2014 at 4:17 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Hi Travis
>
>     Thank you for your detailed answer and for honoring my question
>     with a blog entry :-)
>
>
> No problem.  I had been meaning to write something up. Thanks for the 
> prod. :-)
>
>
>     I will look into bus quiescing with admins but I'm under the
>     impression that nothing special is done, the HW RAID controller
>     taking care of everything, HP doc stating that inserting a
>     hot-pluggable disk induces one or two seconds pause in disk
>     activity. I'll check whether this is handled through the
>     controller cache and/or done out of business hours for safety.
>
>     I'll ask for internal benchmarking hoping it will convince
>     everyone to accept the JBOD model and automate what's necessary
>     for it not to disrupt operations
>
>
> Make sure your HP disk controllers can actually do JBOD.  Last I 
> looked (admittedly, this was ~3 years ago), you could only simulate it 
> doing multiple single-disk RAID0 LUNs. Operationally, these were one 
> level annoying more than just JBOD because now you had to 
> remove/destroy the RAID0 LUN when replacing the disk before you could 
> recreate the filesystem.
>
> At the very least, HP's reference architecture for Cloudera Hadoop 
> 5.X, updated in August 2014, shows that this is still the case.
>
> From their document:
>
>          Drives should use a Just a Bunch of Disks 
>
>         (JBOD) configuration, which can be achieved with the HP Smart
>         Array P420i controller by configuring each individual disk as 
>
>         a separate RAID 0 volume. Additionally array acceleration
>         features on the P420i should be turned off for the RAID 0 data 
>
>         volumes. The first two positions on the drive cage allow the
>         OS drive to be placed in RAID1. 
>
>
> http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en
>
> If you've already got equipment, then probably not a big deal.  If 
> you're in the process of evaluating new stuff, I'd ask your HP 
> Var/Reseller if there was a different, non-RAID option, especially if 
> they have something similar to the LSI-9207-8i card which uses the 
> LSI2308 SAS chip.  I know Dell validated this with their 12G equipment 
> for use with Hadoop earlier this year.  Definitely a great card so far.
>
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Sun, Oct 5, 2014 at 4:17 PM, Ulul <ha...@ulul.org> wrote:

>  Hi Travis
>
> Thank you for your detailed answer and for honoring my question with a
> blog entry :-)
>

No problem.  I had been meaning to write something up.  Thanks for the
prod. :-)

>
> I will look into bus quiescing with admins but I'm under the impression
> that nothing special is done, the HW RAID controller taking care of
> everything, HP doc stating that inserting a hot-pluggable disk induces one
> or two seconds pause in disk activity. I'll check whether this is handled
> through the controller cache and/or done out of business hours for safety.
>
> I'll ask for internal benchmarking hoping it will convince everyone to
> accept the JBOD model and automate what's necessary for it not to disrupt
> operations
>

Make sure your HP disk controllers can actually do JBOD.  Last I looked
(admittedly, this was ~3 years ago), you could only simulate it doing
multiple single-disk RAID0 LUNs.  Operationally, these were one level
annoying more than just JBOD because now you had to remove/destroy the
RAID0 LUN when replacing the disk before you could recreate the filesystem.

At the very least, HP's reference architecture for Cloudera Hadoop 5.X,
updated in August 2014, shows that this is still the case.

>From their document:

 Drives should use a Just a Bunch of Disks
>
> (JBOD) configuration, which can be achieved with the HP Smart Array P420i
>> controller by configuring each individual disk as
>
> a separate RAID 0 volume. Additionally array acceleration features on the
>> P420i should be turned off for the RAID 0 data
>
> volumes. The first two positions on the drive cage allow the OS drive to
>> be placed in RAID1.
>
>
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en

If you've already got equipment, then probably not a big deal.  If you're
in the process of evaluating new stuff, I'd ask your HP Var/Reseller if
there was a different, non-RAID option, especially if they have something
similar to the LSI-9207-8i card which uses the LSI2308 SAS chip.  I know
Dell validated this with their 12G equipment for use with Hadoop earlier
this year.  Definitely a great card so far.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Sun, Oct 5, 2014 at 4:17 PM, Ulul <ha...@ulul.org> wrote:

>  Hi Travis
>
> Thank you for your detailed answer and for honoring my question with a
> blog entry :-)
>

No problem.  I had been meaning to write something up.  Thanks for the
prod. :-)

>
> I will look into bus quiescing with admins but I'm under the impression
> that nothing special is done, the HW RAID controller taking care of
> everything, HP doc stating that inserting a hot-pluggable disk induces one
> or two seconds pause in disk activity. I'll check whether this is handled
> through the controller cache and/or done out of business hours for safety.
>
> I'll ask for internal benchmarking hoping it will convince everyone to
> accept the JBOD model and automate what's necessary for it not to disrupt
> operations
>

Make sure your HP disk controllers can actually do JBOD.  Last I looked
(admittedly, this was ~3 years ago), you could only simulate it doing
multiple single-disk RAID0 LUNs.  Operationally, these were one level
annoying more than just JBOD because now you had to remove/destroy the
RAID0 LUN when replacing the disk before you could recreate the filesystem.

At the very least, HP's reference architecture for Cloudera Hadoop 5.X,
updated in August 2014, shows that this is still the case.

>From their document:

 Drives should use a Just a Bunch of Disks
>
> (JBOD) configuration, which can be achieved with the HP Smart Array P420i
>> controller by configuring each individual disk as
>
> a separate RAID 0 volume. Additionally array acceleration features on the
>> P420i should be turned off for the RAID 0 data
>
> volumes. The first two positions on the drive cage allow the OS drive to
>> be placed in RAID1.
>
>
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en

If you've already got equipment, then probably not a big deal.  If you're
in the process of evaluating new stuff, I'd ask your HP Var/Reseller if
there was a different, non-RAID option, especially if they have something
similar to the LSI-9207-8i card which uses the LSI2308 SAS chip.  I know
Dell validated this with their 12G equipment for use with Hadoop earlier
this year.  Definitely a great card so far.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Sun, Oct 5, 2014 at 4:17 PM, Ulul <ha...@ulul.org> wrote:

>  Hi Travis
>
> Thank you for your detailed answer and for honoring my question with a
> blog entry :-)
>

No problem.  I had been meaning to write something up.  Thanks for the
prod. :-)

>
> I will look into bus quiescing with admins but I'm under the impression
> that nothing special is done, the HW RAID controller taking care of
> everything, HP doc stating that inserting a hot-pluggable disk induces one
> or two seconds pause in disk activity. I'll check whether this is handled
> through the controller cache and/or done out of business hours for safety.
>
> I'll ask for internal benchmarking hoping it will convince everyone to
> accept the JBOD model and automate what's necessary for it not to disrupt
> operations
>

Make sure your HP disk controllers can actually do JBOD.  Last I looked
(admittedly, this was ~3 years ago), you could only simulate it doing
multiple single-disk RAID0 LUNs.  Operationally, these were one level
annoying more than just JBOD because now you had to remove/destroy the
RAID0 LUN when replacing the disk before you could recreate the filesystem.

At the very least, HP's reference architecture for Cloudera Hadoop 5.X,
updated in August 2014, shows that this is still the case.

>From their document:

 Drives should use a Just a Bunch of Disks
>
> (JBOD) configuration, which can be achieved with the HP Smart Array P420i
>> controller by configuring each individual disk as
>
> a separate RAID 0 volume. Additionally array acceleration features on the
>> P420i should be turned off for the RAID 0 data
>
> volumes. The first two positions on the drive cage allow the OS drive to
>> be placed in RAID1.
>
>
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en

If you've already got equipment, then probably not a big deal.  If you're
in the process of evaluating new stuff, I'd ask your HP Var/Reseller if
there was a different, non-RAID option, especially if they have something
similar to the LSI-9207-8i card which uses the LSI2308 SAS chip.  I know
Dell validated this with their 12G equipment for use with Hadoop earlier
this year.  Definitely a great card so far.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Sun, Oct 5, 2014 at 4:17 PM, Ulul <ha...@ulul.org> wrote:

>  Hi Travis
>
> Thank you for your detailed answer and for honoring my question with a
> blog entry :-)
>

No problem.  I had been meaning to write something up.  Thanks for the
prod. :-)

>
> I will look into bus quiescing with admins but I'm under the impression
> that nothing special is done, the HW RAID controller taking care of
> everything, HP doc stating that inserting a hot-pluggable disk induces one
> or two seconds pause in disk activity. I'll check whether this is handled
> through the controller cache and/or done out of business hours for safety.
>
> I'll ask for internal benchmarking hoping it will convince everyone to
> accept the JBOD model and automate what's necessary for it not to disrupt
> operations
>

Make sure your HP disk controllers can actually do JBOD.  Last I looked
(admittedly, this was ~3 years ago), you could only simulate it doing
multiple single-disk RAID0 LUNs.  Operationally, these were one level
annoying more than just JBOD because now you had to remove/destroy the
RAID0 LUN when replacing the disk before you could recreate the filesystem.

At the very least, HP's reference architecture for Cloudera Hadoop 5.X,
updated in August 2014, shows that this is still the case.

>From their document:

 Drives should use a Just a Bunch of Disks
>
> (JBOD) configuration, which can be achieved with the HP Smart Array P420i
>> controller by configuring each individual disk as
>
> a separate RAID 0 volume. Additionally array acceleration features on the
>> P420i should be turned off for the RAID 0 data
>
> volumes. The first two positions on the drive cage allow the OS drive to
>> be placed in RAID1.
>
>
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3257ENW&cc=us&lc=en

If you've already got equipment, then probably not a big deal.  If you're
in the process of evaluating new stuff, I'd ask your HP Var/Reseller if
there was a different, non-RAID option, especially if they have something
similar to the LSI-9207-8i card which uses the LSI2308 SAS chip.  I know
Dell validated this with their 12G equipment for use with Hadoop earlier
this year.  Definitely a great card so far.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Hi Travis

Thank you for your detailed answer and for honoring my question with a 
blog entry :-)

I will look into bus quiescing with admins but I'm under the impression 
that nothing special is done, the HW RAID controller taking care of 
everything, HP doc stating that inserting a hot-pluggable disk induces 
one or two seconds pause in disk activity. I'll check whether this is 
handled through the controller cache and/or done out of business hours 
for safety.

I'll ask for internal benchmarking hoping it will convince everyone to 
accept the JBOD model and automate what's necessary for it not to 
disrupt operations

Thanks again
Ulul

Le 02/10/2014 00:25, Travis a écrit :
>
> On Wed, Oct 1, 2014 at 4:01 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Dear hadoopers,
>
>     Has anyone been confronted to deploying a cluster in a traditional
>     IT shop whose admins handle thousands of servers ?
>     They traditionally use SAN or NAS storage for app data, rely on
>     RAID 1 for system disks and in the few cases where internal disks
>     are used, they configure them with RAID 5 provided by the internal
>     HW controller.
>
>
> Yes.  I've been on both sides of this discussion.
>
> The key is to help them understand that you don't need redundancy 
> within a system because Hadoop provides redundancy across the entire 
> cluster via replication. This then leaves the problem as a performance 
> one, in which case you show them benchmarks on the hardware they 
> provide in both RAID (RAID0, RAID1, and RAID5) and JBOD modes.
>
>     Using a JBOD setup , as advised in each and every Hadoop doc I
>     ever laid my hands on, means that each HDD failure will imply, on
>     top of the physical replacement of the drive, that an admin
>     performs at least an mkfs.
>     Added to the fact that these operations will become more frequent
>     since more internal disks will be used, it can be perceived as an
>     annoying disruption in industrial handling of numerous servers.
>
>
> I fail to see how this is really any different than the process of 
> having to deal with a failed drive in an array.  Depending on your 
> array type, you may still have to do things to quiesce the bus before 
> doing any drive operation, such as adding or removing the drive, you 
> may still have to trigger the rebuild yourself, and so on.
>
> I have a few thousand disks in my cluster.  We lose about 3-5 a 
> quarter.  I don't find it any more work to re-mkfs the drive after 
> it's been swapped out and have built tools around the process to make 
> sure it's consistently done by our DC staff (and yes, I did it before 
> the DC staff was asked to).  If you're concerned about the high-touch 
> aspect of swapping disks out, then you can always configure the 
> datanode to be tolerant of multiple disk failures (something you 
> cannot do with RAID5) and then just take the whole machine out of the 
> cluster to do swaps when you've reached a particular threshold of bad 
> disks.
>
>
>     In Tom White's guide there is a discussion of RAID 0, stating that
>     Yahoo benchmarks showed a 10% loss in performance so we can expect
>     even worse perf with RAID 5 but I found no figures.
>
>
> I had to re-read that section for reference.  My apologies if the 
> following is a little long-winded and rambling.
>
> I'm going to assume that Tom is not talking about single-disk RAID0 
> volumes, which is a common way of doing JBOD with a RAID controller 
> that doesn't have JBOD support.
>
> In general, performance is going to depend upon how many active 
> streams of I/O you have going on the system.
>
> With JBOD, as Tom discusses, every spindle is it's own unique snow 
> flake, and if your drive controller can keep up, you can write as fast 
> as that drive can handle reading off the bus.  Performance is going to 
> depend upon how many active reading/writing streams you have accessing 
> each spindle in the systems.
>
> If I had one stream, I would only get the performance of one spindle 
> in the JBOD. If I had twelve spindles, I'm going to get maximum 
> performance with at least twelve streams. With RAID0, you're taking 
> your one stream, cutting it up into multiple parts and either reading 
> it or writing it to all disks, taking advantage of the performance of 
> all spindles.
>
> The problem arises when you start adding more streams in parallel to 
> the RAID0 environment.  Each parallel I/O operation begins competing 
> with each other from the controller's standpoint.  Sometimes things 
> start to stack up as the controller has to wait for competing I/O 
> operations on a single spindle.  For example, having to wait for a 
> write to complete before a read can be done.
>
> At a certain point, the performance of RAID0 begins to hit a knee as 
> the number of I/O requests goes up because the controller becomes the 
> bottleneck.  RAID0 is going to be the closest in performance, but with 
> the risk that if you lose a single disk, you lose the entire RAID.  
> With JBOD, if you lose a single disk, you only lose the data on that disk.
>
> Now, with RAID5, you're going to have even worse performance because 
> you're dealing with not only the parity calculation, but also with the 
> fact that you incur a performance penalty during reads and writes due 
> to how the data is laid out across all disks in the RAID.   You ca 
> read more about this here: 
> http://theithollow.com/2012/03/understanding-raid-penalty/
>
> To put this in perspective, I use 12 7200rpm NLSAS disks in a system 
> connected to an LSI9207 SAS controller. This is configured for JBOD.  
> I have benchmarked streaming reads and writes in this environment to 
> be between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a 
> total of 12 i/o streams occurring on the system.  Btw, this benchmark 
> has held stable at this rate for at least 3 i/o streams per spindle; I 
> haven't tested higher yet.
>
> Now, I might get this performance with RAID0, but why should I 
> tolerate the risk of losing all data on the system vs just the data on 
> a single drive?  Going with RAID0 means that not only do I have to 
> replace the disk, but now I have to have Hadoop rebalance/redistribute 
> data to the entire system, not just dealing with the small amount of 
> data missing from one spindle.  And since Hadoop is already handling 
> my redundancy via replication of data, why should I tolerate the 
> performance penalty associated with RAID5?  I don't need redundancy in 
> a *single* system, I need redundancy across the entire cluster.
>
>
>
>     I also found an Hortonworks interview of StackIQ who provides
>     software to automate such failure fix up. But it would be rather
>     painful to go straight to another solution, contract and so on
>     while starting with Hadoop.
>
>     Please share your experiences around RAID for redundancy (1, 5 or
>     other) in Hadoop conf.
>
>
> I can't see any situation that we would use RAID for the data drives 
> in our Hadoop cluster.  We only use RAID1 for the OS drives, simply 
> because we want to reduce the recovery period associated with a system 
> failure.  No reason to re-install a system and have to replicate data 
> back onto it if we don't have to.
>
> Cheers,
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Hi Travis

Thank you for your detailed answer and for honoring my question with a 
blog entry :-)

I will look into bus quiescing with admins but I'm under the impression 
that nothing special is done, the HW RAID controller taking care of 
everything, HP doc stating that inserting a hot-pluggable disk induces 
one or two seconds pause in disk activity. I'll check whether this is 
handled through the controller cache and/or done out of business hours 
for safety.

I'll ask for internal benchmarking hoping it will convince everyone to 
accept the JBOD model and automate what's necessary for it not to 
disrupt operations

Thanks again
Ulul

Le 02/10/2014 00:25, Travis a écrit :
>
> On Wed, Oct 1, 2014 at 4:01 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Dear hadoopers,
>
>     Has anyone been confronted to deploying a cluster in a traditional
>     IT shop whose admins handle thousands of servers ?
>     They traditionally use SAN or NAS storage for app data, rely on
>     RAID 1 for system disks and in the few cases where internal disks
>     are used, they configure them with RAID 5 provided by the internal
>     HW controller.
>
>
> Yes.  I've been on both sides of this discussion.
>
> The key is to help them understand that you don't need redundancy 
> within a system because Hadoop provides redundancy across the entire 
> cluster via replication. This then leaves the problem as a performance 
> one, in which case you show them benchmarks on the hardware they 
> provide in both RAID (RAID0, RAID1, and RAID5) and JBOD modes.
>
>     Using a JBOD setup , as advised in each and every Hadoop doc I
>     ever laid my hands on, means that each HDD failure will imply, on
>     top of the physical replacement of the drive, that an admin
>     performs at least an mkfs.
>     Added to the fact that these operations will become more frequent
>     since more internal disks will be used, it can be perceived as an
>     annoying disruption in industrial handling of numerous servers.
>
>
> I fail to see how this is really any different than the process of 
> having to deal with a failed drive in an array.  Depending on your 
> array type, you may still have to do things to quiesce the bus before 
> doing any drive operation, such as adding or removing the drive, you 
> may still have to trigger the rebuild yourself, and so on.
>
> I have a few thousand disks in my cluster.  We lose about 3-5 a 
> quarter.  I don't find it any more work to re-mkfs the drive after 
> it's been swapped out and have built tools around the process to make 
> sure it's consistently done by our DC staff (and yes, I did it before 
> the DC staff was asked to).  If you're concerned about the high-touch 
> aspect of swapping disks out, then you can always configure the 
> datanode to be tolerant of multiple disk failures (something you 
> cannot do with RAID5) and then just take the whole machine out of the 
> cluster to do swaps when you've reached a particular threshold of bad 
> disks.
>
>
>     In Tom White's guide there is a discussion of RAID 0, stating that
>     Yahoo benchmarks showed a 10% loss in performance so we can expect
>     even worse perf with RAID 5 but I found no figures.
>
>
> I had to re-read that section for reference.  My apologies if the 
> following is a little long-winded and rambling.
>
> I'm going to assume that Tom is not talking about single-disk RAID0 
> volumes, which is a common way of doing JBOD with a RAID controller 
> that doesn't have JBOD support.
>
> In general, performance is going to depend upon how many active 
> streams of I/O you have going on the system.
>
> With JBOD, as Tom discusses, every spindle is it's own unique snow 
> flake, and if your drive controller can keep up, you can write as fast 
> as that drive can handle reading off the bus.  Performance is going to 
> depend upon how many active reading/writing streams you have accessing 
> each spindle in the systems.
>
> If I had one stream, I would only get the performance of one spindle 
> in the JBOD. If I had twelve spindles, I'm going to get maximum 
> performance with at least twelve streams. With RAID0, you're taking 
> your one stream, cutting it up into multiple parts and either reading 
> it or writing it to all disks, taking advantage of the performance of 
> all spindles.
>
> The problem arises when you start adding more streams in parallel to 
> the RAID0 environment.  Each parallel I/O operation begins competing 
> with each other from the controller's standpoint.  Sometimes things 
> start to stack up as the controller has to wait for competing I/O 
> operations on a single spindle.  For example, having to wait for a 
> write to complete before a read can be done.
>
> At a certain point, the performance of RAID0 begins to hit a knee as 
> the number of I/O requests goes up because the controller becomes the 
> bottleneck.  RAID0 is going to be the closest in performance, but with 
> the risk that if you lose a single disk, you lose the entire RAID.  
> With JBOD, if you lose a single disk, you only lose the data on that disk.
>
> Now, with RAID5, you're going to have even worse performance because 
> you're dealing with not only the parity calculation, but also with the 
> fact that you incur a performance penalty during reads and writes due 
> to how the data is laid out across all disks in the RAID.   You ca 
> read more about this here: 
> http://theithollow.com/2012/03/understanding-raid-penalty/
>
> To put this in perspective, I use 12 7200rpm NLSAS disks in a system 
> connected to an LSI9207 SAS controller. This is configured for JBOD.  
> I have benchmarked streaming reads and writes in this environment to 
> be between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a 
> total of 12 i/o streams occurring on the system.  Btw, this benchmark 
> has held stable at this rate for at least 3 i/o streams per spindle; I 
> haven't tested higher yet.
>
> Now, I might get this performance with RAID0, but why should I 
> tolerate the risk of losing all data on the system vs just the data on 
> a single drive?  Going with RAID0 means that not only do I have to 
> replace the disk, but now I have to have Hadoop rebalance/redistribute 
> data to the entire system, not just dealing with the small amount of 
> data missing from one spindle.  And since Hadoop is already handling 
> my redundancy via replication of data, why should I tolerate the 
> performance penalty associated with RAID5?  I don't need redundancy in 
> a *single* system, I need redundancy across the entire cluster.
>
>
>
>     I also found an Hortonworks interview of StackIQ who provides
>     software to automate such failure fix up. But it would be rather
>     painful to go straight to another solution, contract and so on
>     while starting with Hadoop.
>
>     Please share your experiences around RAID for redundancy (1, 5 or
>     other) in Hadoop conf.
>
>
> I can't see any situation that we would use RAID for the data drives 
> in our Hadoop cluster.  We only use RAID1 for the OS drives, simply 
> because we want to reduce the recovery period associated with a system 
> failure.  No reason to re-install a system and have to replicate data 
> back onto it if we don't have to.
>
> Cheers,
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Hi Travis

Thank you for your detailed answer and for honoring my question with a 
blog entry :-)

I will look into bus quiescing with admins but I'm under the impression 
that nothing special is done, the HW RAID controller taking care of 
everything, HP doc stating that inserting a hot-pluggable disk induces 
one or two seconds pause in disk activity. I'll check whether this is 
handled through the controller cache and/or done out of business hours 
for safety.

I'll ask for internal benchmarking hoping it will convince everyone to 
accept the JBOD model and automate what's necessary for it not to 
disrupt operations

Thanks again
Ulul

Le 02/10/2014 00:25, Travis a écrit :
>
> On Wed, Oct 1, 2014 at 4:01 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Dear hadoopers,
>
>     Has anyone been confronted to deploying a cluster in a traditional
>     IT shop whose admins handle thousands of servers ?
>     They traditionally use SAN or NAS storage for app data, rely on
>     RAID 1 for system disks and in the few cases where internal disks
>     are used, they configure them with RAID 5 provided by the internal
>     HW controller.
>
>
> Yes.  I've been on both sides of this discussion.
>
> The key is to help them understand that you don't need redundancy 
> within a system because Hadoop provides redundancy across the entire 
> cluster via replication. This then leaves the problem as a performance 
> one, in which case you show them benchmarks on the hardware they 
> provide in both RAID (RAID0, RAID1, and RAID5) and JBOD modes.
>
>     Using a JBOD setup , as advised in each and every Hadoop doc I
>     ever laid my hands on, means that each HDD failure will imply, on
>     top of the physical replacement of the drive, that an admin
>     performs at least an mkfs.
>     Added to the fact that these operations will become more frequent
>     since more internal disks will be used, it can be perceived as an
>     annoying disruption in industrial handling of numerous servers.
>
>
> I fail to see how this is really any different than the process of 
> having to deal with a failed drive in an array.  Depending on your 
> array type, you may still have to do things to quiesce the bus before 
> doing any drive operation, such as adding or removing the drive, you 
> may still have to trigger the rebuild yourself, and so on.
>
> I have a few thousand disks in my cluster.  We lose about 3-5 a 
> quarter.  I don't find it any more work to re-mkfs the drive after 
> it's been swapped out and have built tools around the process to make 
> sure it's consistently done by our DC staff (and yes, I did it before 
> the DC staff was asked to).  If you're concerned about the high-touch 
> aspect of swapping disks out, then you can always configure the 
> datanode to be tolerant of multiple disk failures (something you 
> cannot do with RAID5) and then just take the whole machine out of the 
> cluster to do swaps when you've reached a particular threshold of bad 
> disks.
>
>
>     In Tom White's guide there is a discussion of RAID 0, stating that
>     Yahoo benchmarks showed a 10% loss in performance so we can expect
>     even worse perf with RAID 5 but I found no figures.
>
>
> I had to re-read that section for reference.  My apologies if the 
> following is a little long-winded and rambling.
>
> I'm going to assume that Tom is not talking about single-disk RAID0 
> volumes, which is a common way of doing JBOD with a RAID controller 
> that doesn't have JBOD support.
>
> In general, performance is going to depend upon how many active 
> streams of I/O you have going on the system.
>
> With JBOD, as Tom discusses, every spindle is it's own unique snow 
> flake, and if your drive controller can keep up, you can write as fast 
> as that drive can handle reading off the bus.  Performance is going to 
> depend upon how many active reading/writing streams you have accessing 
> each spindle in the systems.
>
> If I had one stream, I would only get the performance of one spindle 
> in the JBOD. If I had twelve spindles, I'm going to get maximum 
> performance with at least twelve streams. With RAID0, you're taking 
> your one stream, cutting it up into multiple parts and either reading 
> it or writing it to all disks, taking advantage of the performance of 
> all spindles.
>
> The problem arises when you start adding more streams in parallel to 
> the RAID0 environment.  Each parallel I/O operation begins competing 
> with each other from the controller's standpoint.  Sometimes things 
> start to stack up as the controller has to wait for competing I/O 
> operations on a single spindle.  For example, having to wait for a 
> write to complete before a read can be done.
>
> At a certain point, the performance of RAID0 begins to hit a knee as 
> the number of I/O requests goes up because the controller becomes the 
> bottleneck.  RAID0 is going to be the closest in performance, but with 
> the risk that if you lose a single disk, you lose the entire RAID.  
> With JBOD, if you lose a single disk, you only lose the data on that disk.
>
> Now, with RAID5, you're going to have even worse performance because 
> you're dealing with not only the parity calculation, but also with the 
> fact that you incur a performance penalty during reads and writes due 
> to how the data is laid out across all disks in the RAID.   You ca 
> read more about this here: 
> http://theithollow.com/2012/03/understanding-raid-penalty/
>
> To put this in perspective, I use 12 7200rpm NLSAS disks in a system 
> connected to an LSI9207 SAS controller. This is configured for JBOD.  
> I have benchmarked streaming reads and writes in this environment to 
> be between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a 
> total of 12 i/o streams occurring on the system.  Btw, this benchmark 
> has held stable at this rate for at least 3 i/o streams per spindle; I 
> haven't tested higher yet.
>
> Now, I might get this performance with RAID0, but why should I 
> tolerate the risk of losing all data on the system vs just the data on 
> a single drive?  Going with RAID0 means that not only do I have to 
> replace the disk, but now I have to have Hadoop rebalance/redistribute 
> data to the entire system, not just dealing with the small amount of 
> data missing from one spindle.  And since Hadoop is already handling 
> my redundancy via replication of data, why should I tolerate the 
> performance penalty associated with RAID5?  I don't need redundancy in 
> a *single* system, I need redundancy across the entire cluster.
>
>
>
>     I also found an Hortonworks interview of StackIQ who provides
>     software to automate such failure fix up. But it would be rather
>     painful to go straight to another solution, contract and so on
>     while starting with Hadoop.
>
>     Please share your experiences around RAID for redundancy (1, 5 or
>     other) in Hadoop conf.
>
>
> I can't see any situation that we would use RAID for the data drives 
> in our Hadoop cluster.  We only use RAID1 for the OS drives, simply 
> because we want to reduce the recovery period associated with a system 
> failure.  No reason to re-install a system and have to replicate data 
> back onto it if we don't have to.
>
> Cheers,
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Ulul <ha...@ulul.org>.

Hi Travis

Thank you for your detailed answer and for honoring my question with a 
blog entry :-)

I will look into bus quiescing with admins but I'm under the impression 
that nothing special is done, the HW RAID controller taking care of 
everything, HP doc stating that inserting a hot-pluggable disk induces 
one or two seconds pause in disk activity. I'll check whether this is 
handled through the controller cache and/or done out of business hours 
for safety.

I'll ask for internal benchmarking hoping it will convince everyone to 
accept the JBOD model and automate what's necessary for it not to 
disrupt operations

Thanks again
Ulul

Le 02/10/2014 00:25, Travis a écrit :
>
> On Wed, Oct 1, 2014 at 4:01 PM, Ulul <hadoop@ulul.org 
> <ma...@ulul.org>> wrote:
>
>     Dear hadoopers,
>
>     Has anyone been confronted to deploying a cluster in a traditional
>     IT shop whose admins handle thousands of servers ?
>     They traditionally use SAN or NAS storage for app data, rely on
>     RAID 1 for system disks and in the few cases where internal disks
>     are used, they configure them with RAID 5 provided by the internal
>     HW controller.
>
>
> Yes.  I've been on both sides of this discussion.
>
> The key is to help them understand that you don't need redundancy 
> within a system because Hadoop provides redundancy across the entire 
> cluster via replication. This then leaves the problem as a performance 
> one, in which case you show them benchmarks on the hardware they 
> provide in both RAID (RAID0, RAID1, and RAID5) and JBOD modes.
>
>     Using a JBOD setup , as advised in each and every Hadoop doc I
>     ever laid my hands on, means that each HDD failure will imply, on
>     top of the physical replacement of the drive, that an admin
>     performs at least an mkfs.
>     Added to the fact that these operations will become more frequent
>     since more internal disks will be used, it can be perceived as an
>     annoying disruption in industrial handling of numerous servers.
>
>
> I fail to see how this is really any different than the process of 
> having to deal with a failed drive in an array.  Depending on your 
> array type, you may still have to do things to quiesce the bus before 
> doing any drive operation, such as adding or removing the drive, you 
> may still have to trigger the rebuild yourself, and so on.
>
> I have a few thousand disks in my cluster.  We lose about 3-5 a 
> quarter.  I don't find it any more work to re-mkfs the drive after 
> it's been swapped out and have built tools around the process to make 
> sure it's consistently done by our DC staff (and yes, I did it before 
> the DC staff was asked to).  If you're concerned about the high-touch 
> aspect of swapping disks out, then you can always configure the 
> datanode to be tolerant of multiple disk failures (something you 
> cannot do with RAID5) and then just take the whole machine out of the 
> cluster to do swaps when you've reached a particular threshold of bad 
> disks.
>
>
>     In Tom White's guide there is a discussion of RAID 0, stating that
>     Yahoo benchmarks showed a 10% loss in performance so we can expect
>     even worse perf with RAID 5 but I found no figures.
>
>
> I had to re-read that section for reference.  My apologies if the 
> following is a little long-winded and rambling.
>
> I'm going to assume that Tom is not talking about single-disk RAID0 
> volumes, which is a common way of doing JBOD with a RAID controller 
> that doesn't have JBOD support.
>
> In general, performance is going to depend upon how many active 
> streams of I/O you have going on the system.
>
> With JBOD, as Tom discusses, every spindle is it's own unique snow 
> flake, and if your drive controller can keep up, you can write as fast 
> as that drive can handle reading off the bus.  Performance is going to 
> depend upon how many active reading/writing streams you have accessing 
> each spindle in the systems.
>
> If I had one stream, I would only get the performance of one spindle 
> in the JBOD. If I had twelve spindles, I'm going to get maximum 
> performance with at least twelve streams. With RAID0, you're taking 
> your one stream, cutting it up into multiple parts and either reading 
> it or writing it to all disks, taking advantage of the performance of 
> all spindles.
>
> The problem arises when you start adding more streams in parallel to 
> the RAID0 environment.  Each parallel I/O operation begins competing 
> with each other from the controller's standpoint.  Sometimes things 
> start to stack up as the controller has to wait for competing I/O 
> operations on a single spindle.  For example, having to wait for a 
> write to complete before a read can be done.
>
> At a certain point, the performance of RAID0 begins to hit a knee as 
> the number of I/O requests goes up because the controller becomes the 
> bottleneck.  RAID0 is going to be the closest in performance, but with 
> the risk that if you lose a single disk, you lose the entire RAID.  
> With JBOD, if you lose a single disk, you only lose the data on that disk.
>
> Now, with RAID5, you're going to have even worse performance because 
> you're dealing with not only the parity calculation, but also with the 
> fact that you incur a performance penalty during reads and writes due 
> to how the data is laid out across all disks in the RAID.   You ca 
> read more about this here: 
> http://theithollow.com/2012/03/understanding-raid-penalty/
>
> To put this in perspective, I use 12 7200rpm NLSAS disks in a system 
> connected to an LSI9207 SAS controller. This is configured for JBOD.  
> I have benchmarked streaming reads and writes in this environment to 
> be between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a 
> total of 12 i/o streams occurring on the system.  Btw, this benchmark 
> has held stable at this rate for at least 3 i/o streams per spindle; I 
> haven't tested higher yet.
>
> Now, I might get this performance with RAID0, but why should I 
> tolerate the risk of losing all data on the system vs just the data on 
> a single drive?  Going with RAID0 means that not only do I have to 
> replace the disk, but now I have to have Hadoop rebalance/redistribute 
> data to the entire system, not just dealing with the small amount of 
> data missing from one spindle.  And since Hadoop is already handling 
> my redundancy via replication of data, why should I tolerate the 
> performance penalty associated with RAID5?  I don't need redundancy in 
> a *single* system, I need redundancy across the entire cluster.
>
>
>
>     I also found an Hortonworks interview of StackIQ who provides
>     software to automate such failure fix up. But it would be rather
>     painful to go straight to another solution, contract and so on
>     while starting with Hadoop.
>
>     Please share your experiences around RAID for redundancy (1, 5 or
>     other) in Hadoop conf.
>
>
> I can't see any situation that we would use RAID for the data drives 
> in our Hadoop cluster.  We only use RAID1 for the OS drives, simply 
> because we want to reduce the recovery period associated with a system 
> failure.  No reason to re-install a system and have to replicate data 
> back onto it if we don't have to.
>
> Cheers,
> Travis
> -- 
> Travis Campbell
> travis@ghostar.org <ma...@ghostar.org>

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Wed, Oct 1, 2014 at 4:01 PM, Ulul <ha...@ulul.org> wrote:

>  Dear hadoopers,
>
> Has anyone been confronted to deploying a cluster in a traditional IT shop
> whose admins handle thousands of servers ?
> They traditionally use SAN or NAS storage for app data, rely on RAID 1 for
> system disks and in the few cases where internal disks are used, they
> configure them with RAID 5 provided by the internal HW controller.
>
>
Yes.  I've been on both sides of this discussion.

The key is to help them understand that you don't need redundancy within a
system because Hadoop provides redundancy across the entire cluster via
replication.  This then leaves the problem as a performance one, in which
case you show them benchmarks on the hardware they provide in both RAID
(RAID0, RAID1, and RAID5) and JBOD modes.

> Using a JBOD setup , as advised in each and every Hadoop doc I ever laid
> my hands on, means that each HDD failure will imply, on top of the physical
> replacement of the drive, that an admin performs at least an mkfs.
> Added to the fact that these operations will become more frequent since
> more internal disks will be used, it can be perceived as an annoying
> disruption in industrial handling of numerous servers.
>
>
I fail to see how this is really any different than the process of having
to deal with a failed drive in an array.  Depending on your array type, you
may still have to do things to quiesce the bus before doing any drive
operation, such as adding or removing the drive, you may still have to
trigger the rebuild yourself, and so on.

I have a few thousand disks in my cluster.  We lose about 3-5 a quarter.  I
don't find it any more work to re-mkfs the drive after it's been swapped
out and have built tools around the process to make sure it's consistently
done by our DC staff (and yes, I did it before the DC staff was asked to).
If you're concerned about the high-touch aspect of swapping disks out, then
you can always configure the datanode to be tolerant of multiple disk
failures (something you cannot do with RAID5) and then just take the whole
machine out of the cluster to do swaps when you've reached a particular
threshold of bad disks.

> In Tom White's guide there is a discussion of RAID 0, stating that Yahoo
> benchmarks showed a 10% loss in performance so we can expect even worse
> perf with RAID 5 but I found no figures.
>

I had to re-read that section for reference.  My apologies if the following
is a little long-winded and rambling.

I'm going to assume that Tom is not talking about single-disk RAID0
volumes, which is a common way of doing JBOD with a RAID controller that
doesn't have JBOD support.

In general, performance is going to depend upon how many active streams of
I/O you have going on the system.

With JBOD, as Tom discusses, every spindle is it's own unique snow flake,
and if your drive controller can keep up, you can write as fast as that
drive can handle reading off the bus.  Performance is going to depend upon
how many active reading/writing streams you have accessing each spindle in
the systems.

If I had one stream, I would only get the performance of one spindle in the
JBOD. If I had twelve spindles, I'm going to get maximum performance with
at least twelve streams. With RAID0, you're taking your one stream, cutting
it up into multiple parts and either reading it or writing it to all disks,
taking advantage of the performance of all spindles.

The problem arises when you start adding more streams in parallel to the
RAID0 environment.  Each parallel I/O operation begins competing with each
other from the controller's standpoint.  Sometimes things start to stack up
as the controller has to wait for competing I/O operations on a single
spindle.  For example, having to wait for a write to complete before a read
can be done.

At a certain point, the performance of RAID0 begins to hit a knee as the
number of I/O requests goes up because the controller becomes the
bottleneck.  RAID0 is going to be the closest in performance, but with the
risk that if you lose a single disk, you lose the entire RAID.  With JBOD,
if you lose a single disk, you only lose the data on that disk.

Now, with RAID5, you're going to have even worse performance because you're
dealing with not only the parity calculation, but also with the fact that
you incur a performance penalty during reads and writes due to how the data
is laid out across all disks in the RAID.   You ca read more about this
here:  http://theithollow.com/2012/03/understanding-raid-penalty/

To put this in perspective, I use 12 7200rpm NLSAS disks in a system
connected to an LSI9207 SAS controller.  This is configured for JBOD.  I
have benchmarked streaming reads and writes in this environment to be
between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a total of
12 i/o streams occurring on the system.  Btw, this benchmark has held
stable at this rate for at least 3 i/o streams per spindle; I haven't
tested higher yet.

Now, I might get this performance with RAID0, but why should I tolerate the
risk of losing all data on the system vs just the data on a single drive?
Going with RAID0 means that not only do I have to replace the disk, but now
I have to have Hadoop rebalance/redistribute data to the entire system, not
just dealing with the small amount of data missing from one spindle.  And
since Hadoop is already handling my redundancy via replication of data, why
should I tolerate the performance penalty associated with RAID5?  I don't
need redundancy in a *single* system, I need redundancy across the entire
cluster.

> I also found an Hortonworks interview of StackIQ who provides software to
> automate such failure fix up. But it would be rather painful to go straight
> to another solution, contract and so on while starting with Hadoop.
>
> Please share your experiences around RAID for redundancy (1, 5 or other)
> in Hadoop conf.
>
>
I can't see any situation that we would use RAID for the data drives in our
Hadoop cluster.  We only use RAID1 for the OS drives, simply because we
want to reduce the recovery period associated with a system failure.  No
reason to re-install a system and have to replicate data back onto it if we
don't have to.

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Wed, Oct 1, 2014 at 4:01 PM, Ulul <ha...@ulul.org> wrote:

>  Dear hadoopers,
>
> Has anyone been confronted to deploying a cluster in a traditional IT shop
> whose admins handle thousands of servers ?
> They traditionally use SAN or NAS storage for app data, rely on RAID 1 for
> system disks and in the few cases where internal disks are used, they
> configure them with RAID 5 provided by the internal HW controller.
>
>
Yes.  I've been on both sides of this discussion.

The key is to help them understand that you don't need redundancy within a
system because Hadoop provides redundancy across the entire cluster via
replication.  This then leaves the problem as a performance one, in which
case you show them benchmarks on the hardware they provide in both RAID
(RAID0, RAID1, and RAID5) and JBOD modes.

> Using a JBOD setup , as advised in each and every Hadoop doc I ever laid
> my hands on, means that each HDD failure will imply, on top of the physical
> replacement of the drive, that an admin performs at least an mkfs.
> Added to the fact that these operations will become more frequent since
> more internal disks will be used, it can be perceived as an annoying
> disruption in industrial handling of numerous servers.
>
>
I fail to see how this is really any different than the process of having
to deal with a failed drive in an array.  Depending on your array type, you
may still have to do things to quiesce the bus before doing any drive
operation, such as adding or removing the drive, you may still have to
trigger the rebuild yourself, and so on.

I have a few thousand disks in my cluster.  We lose about 3-5 a quarter.  I
don't find it any more work to re-mkfs the drive after it's been swapped
out and have built tools around the process to make sure it's consistently
done by our DC staff (and yes, I did it before the DC staff was asked to).
If you're concerned about the high-touch aspect of swapping disks out, then
you can always configure the datanode to be tolerant of multiple disk
failures (something you cannot do with RAID5) and then just take the whole
machine out of the cluster to do swaps when you've reached a particular
threshold of bad disks.

> In Tom White's guide there is a discussion of RAID 0, stating that Yahoo
> benchmarks showed a 10% loss in performance so we can expect even worse
> perf with RAID 5 but I found no figures.
>

I had to re-read that section for reference.  My apologies if the following
is a little long-winded and rambling.

I'm going to assume that Tom is not talking about single-disk RAID0
volumes, which is a common way of doing JBOD with a RAID controller that
doesn't have JBOD support.

In general, performance is going to depend upon how many active streams of
I/O you have going on the system.

With JBOD, as Tom discusses, every spindle is it's own unique snow flake,
and if your drive controller can keep up, you can write as fast as that
drive can handle reading off the bus.  Performance is going to depend upon
how many active reading/writing streams you have accessing each spindle in
the systems.

If I had one stream, I would only get the performance of one spindle in the
JBOD. If I had twelve spindles, I'm going to get maximum performance with
at least twelve streams. With RAID0, you're taking your one stream, cutting
it up into multiple parts and either reading it or writing it to all disks,
taking advantage of the performance of all spindles.

The problem arises when you start adding more streams in parallel to the
RAID0 environment.  Each parallel I/O operation begins competing with each
other from the controller's standpoint.  Sometimes things start to stack up
as the controller has to wait for competing I/O operations on a single
spindle.  For example, having to wait for a write to complete before a read
can be done.

At a certain point, the performance of RAID0 begins to hit a knee as the
number of I/O requests goes up because the controller becomes the
bottleneck.  RAID0 is going to be the closest in performance, but with the
risk that if you lose a single disk, you lose the entire RAID.  With JBOD,
if you lose a single disk, you only lose the data on that disk.

Now, with RAID5, you're going to have even worse performance because you're
dealing with not only the parity calculation, but also with the fact that
you incur a performance penalty during reads and writes due to how the data
is laid out across all disks in the RAID.   You ca read more about this
here:  http://theithollow.com/2012/03/understanding-raid-penalty/

To put this in perspective, I use 12 7200rpm NLSAS disks in a system
connected to an LSI9207 SAS controller.  This is configured for JBOD.  I
have benchmarked streaming reads and writes in this environment to be
between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a total of
12 i/o streams occurring on the system.  Btw, this benchmark has held
stable at this rate for at least 3 i/o streams per spindle; I haven't
tested higher yet.

Now, I might get this performance with RAID0, but why should I tolerate the
risk of losing all data on the system vs just the data on a single drive?
Going with RAID0 means that not only do I have to replace the disk, but now
I have to have Hadoop rebalance/redistribute data to the entire system, not
just dealing with the small amount of data missing from one spindle.  And
since Hadoop is already handling my redundancy via replication of data, why
should I tolerate the performance penalty associated with RAID5?  I don't
need redundancy in a *single* system, I need redundancy across the entire
cluster.

> I also found an Hortonworks interview of StackIQ who provides software to
> automate such failure fix up. But it would be rather painful to go straight
> to another solution, contract and so on while starting with Hadoop.
>
> Please share your experiences around RAID for redundancy (1, 5 or other)
> in Hadoop conf.
>
>
I can't see any situation that we would use RAID for the data drives in our
Hadoop cluster.  We only use RAID1 for the OS drives, simply because we
want to reduce the recovery period associated with a system failure.  No
reason to re-install a system and have to replicate data back onto it if we
don't have to.

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Wed, Oct 1, 2014 at 4:01 PM, Ulul <ha...@ulul.org> wrote:

>  Dear hadoopers,
>
> Has anyone been confronted to deploying a cluster in a traditional IT shop
> whose admins handle thousands of servers ?
> They traditionally use SAN or NAS storage for app data, rely on RAID 1 for
> system disks and in the few cases where internal disks are used, they
> configure them with RAID 5 provided by the internal HW controller.
>
>
Yes.  I've been on both sides of this discussion.

The key is to help them understand that you don't need redundancy within a
system because Hadoop provides redundancy across the entire cluster via
replication.  This then leaves the problem as a performance one, in which
case you show them benchmarks on the hardware they provide in both RAID
(RAID0, RAID1, and RAID5) and JBOD modes.

> Using a JBOD setup , as advised in each and every Hadoop doc I ever laid
> my hands on, means that each HDD failure will imply, on top of the physical
> replacement of the drive, that an admin performs at least an mkfs.
> Added to the fact that these operations will become more frequent since
> more internal disks will be used, it can be perceived as an annoying
> disruption in industrial handling of numerous servers.
>
>
I fail to see how this is really any different than the process of having
to deal with a failed drive in an array.  Depending on your array type, you
may still have to do things to quiesce the bus before doing any drive
operation, such as adding or removing the drive, you may still have to
trigger the rebuild yourself, and so on.

I have a few thousand disks in my cluster.  We lose about 3-5 a quarter.  I
don't find it any more work to re-mkfs the drive after it's been swapped
out and have built tools around the process to make sure it's consistently
done by our DC staff (and yes, I did it before the DC staff was asked to).
If you're concerned about the high-touch aspect of swapping disks out, then
you can always configure the datanode to be tolerant of multiple disk
failures (something you cannot do with RAID5) and then just take the whole
machine out of the cluster to do swaps when you've reached a particular
threshold of bad disks.

> In Tom White's guide there is a discussion of RAID 0, stating that Yahoo
> benchmarks showed a 10% loss in performance so we can expect even worse
> perf with RAID 5 but I found no figures.
>

I had to re-read that section for reference.  My apologies if the following
is a little long-winded and rambling.

I'm going to assume that Tom is not talking about single-disk RAID0
volumes, which is a common way of doing JBOD with a RAID controller that
doesn't have JBOD support.

In general, performance is going to depend upon how many active streams of
I/O you have going on the system.

With JBOD, as Tom discusses, every spindle is it's own unique snow flake,
and if your drive controller can keep up, you can write as fast as that
drive can handle reading off the bus.  Performance is going to depend upon
how many active reading/writing streams you have accessing each spindle in
the systems.

If I had one stream, I would only get the performance of one spindle in the
JBOD. If I had twelve spindles, I'm going to get maximum performance with
at least twelve streams. With RAID0, you're taking your one stream, cutting
it up into multiple parts and either reading it or writing it to all disks,
taking advantage of the performance of all spindles.

The problem arises when you start adding more streams in parallel to the
RAID0 environment.  Each parallel I/O operation begins competing with each
other from the controller's standpoint.  Sometimes things start to stack up
as the controller has to wait for competing I/O operations on a single
spindle.  For example, having to wait for a write to complete before a read
can be done.

At a certain point, the performance of RAID0 begins to hit a knee as the
number of I/O requests goes up because the controller becomes the
bottleneck.  RAID0 is going to be the closest in performance, but with the
risk that if you lose a single disk, you lose the entire RAID.  With JBOD,
if you lose a single disk, you only lose the data on that disk.

Now, with RAID5, you're going to have even worse performance because you're
dealing with not only the parity calculation, but also with the fact that
you incur a performance penalty during reads and writes due to how the data
is laid out across all disks in the RAID.   You ca read more about this
here:  http://theithollow.com/2012/03/understanding-raid-penalty/

To put this in perspective, I use 12 7200rpm NLSAS disks in a system
connected to an LSI9207 SAS controller.  This is configured for JBOD.  I
have benchmarked streaming reads and writes in this environment to be
between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a total of
12 i/o streams occurring on the system.  Btw, this benchmark has held
stable at this rate for at least 3 i/o streams per spindle; I haven't
tested higher yet.

Now, I might get this performance with RAID0, but why should I tolerate the
risk of losing all data on the system vs just the data on a single drive?
Going with RAID0 means that not only do I have to replace the disk, but now
I have to have Hadoop rebalance/redistribute data to the entire system, not
just dealing with the small amount of data missing from one spindle.  And
since Hadoop is already handling my redundancy via replication of data, why
should I tolerate the performance penalty associated with RAID5?  I don't
need redundancy in a *single* system, I need redundancy across the entire
cluster.

> I also found an Hortonworks interview of StackIQ who provides software to
> automate such failure fix up. But it would be rather painful to go straight
> to another solution, contract and so on while starting with Hadoop.
>
> Please share your experiences around RAID for redundancy (1, 5 or other)
> in Hadoop conf.
>
>
I can't see any situation that we would use RAID for the data drives in our
Hadoop cluster.  We only use RAID1 for the OS drives, simply because we
want to reduce the recovery period associated with a system failure.  No
reason to re-install a system and have to replicate data back onto it if we
don't have to.

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org

Re: Hadoop and RAID 5

Posted by Travis <hc...@ghostar.org>.

On Wed, Oct 1, 2014 at 4:01 PM, Ulul <ha...@ulul.org> wrote:

>  Dear hadoopers,
>
> Has anyone been confronted to deploying a cluster in a traditional IT shop
> whose admins handle thousands of servers ?
> They traditionally use SAN or NAS storage for app data, rely on RAID 1 for
> system disks and in the few cases where internal disks are used, they
> configure them with RAID 5 provided by the internal HW controller.
>
>
Yes.  I've been on both sides of this discussion.

The key is to help them understand that you don't need redundancy within a
system because Hadoop provides redundancy across the entire cluster via
replication.  This then leaves the problem as a performance one, in which
case you show them benchmarks on the hardware they provide in both RAID
(RAID0, RAID1, and RAID5) and JBOD modes.

> Using a JBOD setup , as advised in each and every Hadoop doc I ever laid
> my hands on, means that each HDD failure will imply, on top of the physical
> replacement of the drive, that an admin performs at least an mkfs.
> Added to the fact that these operations will become more frequent since
> more internal disks will be used, it can be perceived as an annoying
> disruption in industrial handling of numerous servers.
>
>
I fail to see how this is really any different than the process of having
to deal with a failed drive in an array.  Depending on your array type, you
may still have to do things to quiesce the bus before doing any drive
operation, such as adding or removing the drive, you may still have to
trigger the rebuild yourself, and so on.

I have a few thousand disks in my cluster.  We lose about 3-5 a quarter.  I
don't find it any more work to re-mkfs the drive after it's been swapped
out and have built tools around the process to make sure it's consistently
done by our DC staff (and yes, I did it before the DC staff was asked to).
If you're concerned about the high-touch aspect of swapping disks out, then
you can always configure the datanode to be tolerant of multiple disk
failures (something you cannot do with RAID5) and then just take the whole
machine out of the cluster to do swaps when you've reached a particular
threshold of bad disks.

> In Tom White's guide there is a discussion of RAID 0, stating that Yahoo
> benchmarks showed a 10% loss in performance so we can expect even worse
> perf with RAID 5 but I found no figures.
>

I had to re-read that section for reference.  My apologies if the following
is a little long-winded and rambling.

I'm going to assume that Tom is not talking about single-disk RAID0
volumes, which is a common way of doing JBOD with a RAID controller that
doesn't have JBOD support.

In general, performance is going to depend upon how many active streams of
I/O you have going on the system.

With JBOD, as Tom discusses, every spindle is it's own unique snow flake,
and if your drive controller can keep up, you can write as fast as that
drive can handle reading off the bus.  Performance is going to depend upon
how many active reading/writing streams you have accessing each spindle in
the systems.

If I had one stream, I would only get the performance of one spindle in the
JBOD. If I had twelve spindles, I'm going to get maximum performance with
at least twelve streams. With RAID0, you're taking your one stream, cutting
it up into multiple parts and either reading it or writing it to all disks,
taking advantage of the performance of all spindles.

The problem arises when you start adding more streams in parallel to the
RAID0 environment.  Each parallel I/O operation begins competing with each
other from the controller's standpoint.  Sometimes things start to stack up
as the controller has to wait for competing I/O operations on a single
spindle.  For example, having to wait for a write to complete before a read
can be done.

At a certain point, the performance of RAID0 begins to hit a knee as the
number of I/O requests goes up because the controller becomes the
bottleneck.  RAID0 is going to be the closest in performance, but with the
risk that if you lose a single disk, you lose the entire RAID.  With JBOD,
if you lose a single disk, you only lose the data on that disk.

Now, with RAID5, you're going to have even worse performance because you're
dealing with not only the parity calculation, but also with the fact that
you incur a performance penalty during reads and writes due to how the data
is laid out across all disks in the RAID.   You ca read more about this
here:  http://theithollow.com/2012/03/understanding-raid-penalty/

To put this in perspective, I use 12 7200rpm NLSAS disks in a system
connected to an LSI9207 SAS controller.  This is configured for JBOD.  I
have benchmarked streaming reads and writes in this environment to be
between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for a total of
12 i/o streams occurring on the system.  Btw, this benchmark has held
stable at this rate for at least 3 i/o streams per spindle; I haven't
tested higher yet.

Now, I might get this performance with RAID0, but why should I tolerate the
risk of losing all data on the system vs just the data on a single drive?
Going with RAID0 means that not only do I have to replace the disk, but now
I have to have Hadoop rebalance/redistribute data to the entire system, not
just dealing with the small amount of data missing from one spindle.  And
since Hadoop is already handling my redundancy via replication of data, why
should I tolerate the performance penalty associated with RAID5?  I don't
need redundancy in a *single* system, I need redundancy across the entire
cluster.

> I also found an Hortonworks interview of StackIQ who provides software to
> automate such failure fix up. But it would be rather painful to go straight
> to another solution, contract and so on while starting with Hadoop.
>
> Please share your experiences around RAID for redundancy (1, 5 or other)
> in Hadoop conf.
>
>
I can't see any situation that we would use RAID for the data drives in our
Hadoop cluster.  We only use RAID1 for the OS drives, simply because we
want to reduce the recovery period associated with a system failure.  No
reason to re-install a system and have to replicate data back onto it if we
don't have to.

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org