You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2011/09/29 08:50:53 UTC

Is SAN storage is a good option for Hadoop ?

Hi,

I want to know can we use SAN storage for Hadoop cluster setup ?
If yes, what should be the best pratices ?

Is it a good way to do considering the fact "the underlining power of Hadoop
is co-locating the processing power (CPU) with the data storage and thus it
must be local storage to be effective".
*But also, is it better to say “local is better” in the situation where I
have a single local 5400 RPM IDE drive, which  would be dramatically slower
than SAN storage striped  across many drives spinning at 10k RPM and
accessed via fiber channel ?*
*
*
Thanks,
Praveenesh

Re: Is SAN storage is a good option for Hadoop ?

Posted by Paul Ingles <pa...@oobaloo.co.uk>.

Our Hadoop journey included a brief stint running on our own virtualised infrastructure. Our pre-Hadoop application was already running on the VM infrastructure so we set up a small cluster as virtual machines on the SAN.

It worked ok for a while but as our usage grew we ditched it for a couple of reasons:

1) Performance was inconsistent because the infrastructure was multi-teneted (the VM hosts served other applications, and the SAN backed most storage for the company). This became an issue as jobs we'd expect to complete within a few minute or hours would take most of a morning.

2) Cost of growth was a stepped line. At the time we started we had plenty of space, but, as we started viewing HDFS (and Hadoop) as a fantastic place to store structured and unstructured data our storage growth accelerated. We could buy more expensive disks to grow capacity a bit or we'd need to buy a whole new controller. This actually turned out to be the reason to just buy our own physical boxes- the cost of buying 5 decent specced machines was significantly less than the cost of another SAN. 

For us SAN storage would never have worked out- we're now at around 90TB of capacity (probably small compared to some on this list :) but that would have cost us a small fortune.

I can't say much about SAN performance vs. physical performance other than physical was drastically better for us. However, it was the limitations above that caused us to make the leap and it's been well worth it!

On 29 Sep 2011, at 07:50, praveenesh kumar wrote:

> Hi,
> 
> I want to know can we use SAN storage for Hadoop cluster setup ?
> If yes, what should be the best pratices ?
> 
> Is it a good way to do considering the fact "the underlining power of Hadoop
> is co-locating the processing power (CPU) with the data storage and thus it
> must be local storage to be effective".
> *But also, is it better to say “local is better” in the situation where I
> have a single local 5400 RPM IDE drive, which  would be dramatically slower
> than SAN storage striped  across many drives spinning at 10k RPM and
> accessed via fiber channel ?*
> *
> *
> Thanks,
> Praveenesh

Re: Is SAN storage is a good option for Hadoop ?

Posted by Steve Loughran <st...@apache.org>.

On 29/09/11 13:28, Brian Bockelman wrote:
>
> On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote:
>
>> Hi,
>>
>> I want to know can we use SAN storage for Hadoop cluster setup ?
>> If yes, what should be the best pratices ?
>>
>> Is it a good way to do considering the fact "the underlining power of Hadoop
>> is co-locating the processing power (CPU) with the data storage and thus it
>> must be local storage to be effective".
>> *But also, is it better to say “local is better” in the situation where I
>> have a single local 5400 RPM IDE drive, which  would be dramatically slower
>> than SAN storage striped  across many drives spinning at 10k RPM and
>> accessed via fiber channel ?*
>
> Hi Praveenesh,
>
> Two things:
> 1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the high-end SAN is going to win.  That's often false comparison: the question is often "What can I buy for $50k?".  In that case (setting aside organizational politics), you can buy more spindles in the "traditional" Hadoop setup than for the SAN.
>    - Also, if you're latency limited, you're likely working against yourself.  The best thing I ever did for my organization was make our software work just as well with 100ms latency as with 1ms latency.
> 2) As Paul pointed out, you have to ask yourself whether the SAN is shared or dedicated.  Many SANs don't have the ability to strongly partition workloads between users..
>
> Brian
>

One more: SAN is a SPOF. [Gray05] includes the impact of a SAN outage on 
MS TerraServer, while [Jiang08] provides evidence that entry level 
FibreChannel storage is less reliable than SATA due to interconnects.

Anyone who criticises the NameNode for being a SPOF and relies on a SAN 
instead is missing something obvious.

[Gray05] Empirical Measurements of Disk Failure Rates and Error Rates
[Jiang08] Are disks the dominant contributor for storage failures?

Re: Is SAN storage is a good option for Hadoop ?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote:

> Hi,
> 
> I want to know can we use SAN storage for Hadoop cluster setup ?
> If yes, what should be the best pratices ?
> 
> Is it a good way to do considering the fact "the underlining power of Hadoop
> is co-locating the processing power (CPU) with the data storage and thus it
> must be local storage to be effective".
> *But also, is it better to say “local is better” in the situation where I
> have a single local 5400 RPM IDE drive, which  would be dramatically slower
> than SAN storage striped  across many drives spinning at 10k RPM and
> accessed via fiber channel ?*

Hi Praveenesh,

Two things:
1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the high-end SAN is going to win.  That's often false comparison: the question is often "What can I buy for $50k?".  In that case (setting aside organizational politics), you can buy more spindles in the "traditional" Hadoop setup than for the SAN.
  - Also, if you're latency limited, you're likely working against yourself.  The best thing I ever did for my organization was make our software work just as well with 100ms latency as with 1ms latency.
2) As Paul pointed out, you have to ask yourself whether the SAN is shared or dedicated.  Many SANs don't have the ability to strongly partition workloads between users..

Brian