You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Pamecha, Abhishek" <ap...@x.com> on 2012/10/16 20:28:18 UTC

HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

In a SAN? Would it be a concern if I am relying on HDFS to do the replication and using SAN only for dumb storage tier.  In that case, the only difference is remote vs local access.

Reliability may be, actually,  even better in a SAN coz I would assume any reasonable SAN would provide decent fault-tolerance when its controller(s) fail.

Thanks,
Abhishek

From: Mohamed Riadh Trad [mailto:Mohamed.trad@inria.fr]
Sent: Wednesday, October 17, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com>> wrote:
Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job's time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

--
Have a Nice Day!
Lohit

--
Kevin O'Dell
Customer Operations Engineer, Cloudera

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

In a SAN? Would it be a concern if I am relying on HDFS to do the replication and using SAN only for dumb storage tier.  In that case, the only difference is remote vs local access.

Reliability may be, actually,  even better in a SAN coz I would assume any reasonable SAN would provide decent fault-tolerance when its controller(s) fail.

Thanks,
Abhishek

From: Mohamed Riadh Trad [mailto:Mohamed.trad@inria.fr]
Sent: Wednesday, October 17, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com>> wrote:
Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job's time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

--
Have a Nice Day!
Lohit

--
Kevin O'Dell
Customer Operations Engineer, Cloudera

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

In a SAN? Would it be a concern if I am relying on HDFS to do the replication and using SAN only for dumb storage tier.  In that case, the only difference is remote vs local access.

Reliability may be, actually,  even better in a SAN coz I would assume any reasonable SAN would provide decent fault-tolerance when its controller(s) fail.

Thanks,
Abhishek

From: Mohamed Riadh Trad [mailto:Mohamed.trad@inria.fr]
Sent: Wednesday, October 17, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com>> wrote:
Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job's time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

--
Have a Nice Day!
Lohit

--
Kevin O'Dell
Customer Operations Engineer, Cloudera

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

In a SAN? Would it be a concern if I am relying on HDFS to do the replication and using SAN only for dumb storage tier.  In that case, the only difference is remote vs local access.

Reliability may be, actually,  even better in a SAN coz I would assume any reasonable SAN would provide decent fault-tolerance when its controller(s) fail.

Thanks,
Abhishek

From: Mohamed Riadh Trad [mailto:Mohamed.trad@inria.fr]
Sent: Wednesday, October 17, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com>> wrote:
Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job's time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

--
Have a Nice Day!
Lohit

--
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Mohamed Riadh Trad <Mo...@inria.fr>.

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

> You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.
> 
> http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
> 
> On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:
> Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.
> 
>  
> 
> Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?
> 
>  
> 
> And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.
> 
>  
> 
> So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?
> 
>  
> 
> <3 way Replication and RAID suggestions are great.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> From: lohit [mailto:lohit.vijayarenu@gmail.com] 
> Sent: Tuesday, October 16, 2012 3:26 PM
> To: user@hadoop.apache.org
> Subject: Re: HDFS using SAN
> 
>  
> 
> Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage. 
> 
>  
> 
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>
> 
> It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.
> 
>  
> 
> Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.
> 
>  
> 
> RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.
> 
>  
> 
> Jeff
> 
>  
> 
> From: Pamecha, Abhishek [mailto:apamecha@x.com] 
> Sent: Tuesday, October 16, 2012 11:28 AM
> To: user@hadoop.apache.org
> Subject: HDFS using SAN
> 
>  
> 
> Hi
> 
>  
> 
> I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.
> 
>  
> 
> We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.
> 
>  
> 
> Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?
> 
>  
> 
> And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> 
> 
> 
>  
> 
> -- 
> Have a Nice Day!
> Lohit
> 
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Mohamed Riadh Trad <Mo...@inria.fr>.

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

> You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.
> 
> http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
> 
> On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:
> Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.
> 
>  
> 
> Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?
> 
>  
> 
> And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.
> 
>  
> 
> So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?
> 
>  
> 
> <3 way Replication and RAID suggestions are great.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> From: lohit [mailto:lohit.vijayarenu@gmail.com] 
> Sent: Tuesday, October 16, 2012 3:26 PM
> To: user@hadoop.apache.org
> Subject: Re: HDFS using SAN
> 
>  
> 
> Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage. 
> 
>  
> 
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>
> 
> It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.
> 
>  
> 
> Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.
> 
>  
> 
> RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.
> 
>  
> 
> Jeff
> 
>  
> 
> From: Pamecha, Abhishek [mailto:apamecha@x.com] 
> Sent: Tuesday, October 16, 2012 11:28 AM
> To: user@hadoop.apache.org
> Subject: HDFS using SAN
> 
>  
> 
> Hi
> 
>  
> 
> I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.
> 
>  
> 
> We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.
> 
>  
> 
> Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?
> 
>  
> 
> And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> 
> 
> 
>  
> 
> -- 
> Have a Nice Day!
> Lohit
> 
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Mohamed Riadh Trad <Mo...@inria.fr>.

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

> You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.
> 
> http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
> 
> On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:
> Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.
> 
>  
> 
> Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?
> 
>  
> 
> And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.
> 
>  
> 
> So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?
> 
>  
> 
> <3 way Replication and RAID suggestions are great.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> From: lohit [mailto:lohit.vijayarenu@gmail.com] 
> Sent: Tuesday, October 16, 2012 3:26 PM
> To: user@hadoop.apache.org
> Subject: Re: HDFS using SAN
> 
>  
> 
> Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage. 
> 
>  
> 
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>
> 
> It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.
> 
>  
> 
> Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.
> 
>  
> 
> RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.
> 
>  
> 
> Jeff
> 
>  
> 
> From: Pamecha, Abhishek [mailto:apamecha@x.com] 
> Sent: Tuesday, October 16, 2012 11:28 AM
> To: user@hadoop.apache.org
> Subject: HDFS using SAN
> 
>  
> 
> Hi
> 
>  
> 
> I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.
> 
>  
> 
> We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.
> 
>  
> 
> Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?
> 
>  
> 
> And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> 
> 
> 
>  
> 
> -- 
> Have a Nice Day!
> Lohit
> 
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Mohamed Riadh Trad <Mo...@inria.fr>.

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :

> You may want to take a look at the Netapp White Paper on this.  They have a SAN solution as their Hadoop offering.
> 
> http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
> 
> On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:
> Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.
> 
>  
> 
> Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?
> 
>  
> 
> And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.
> 
>  
> 
> So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?
> 
>  
> 
> <3 way Replication and RAID suggestions are great.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> From: lohit [mailto:lohit.vijayarenu@gmail.com] 
> Sent: Tuesday, October 16, 2012 3:26 PM
> To: user@hadoop.apache.org
> Subject: Re: HDFS using SAN
> 
>  
> 
> Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage. 
> 
>  
> 
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>
> 
> It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.
> 
>  
> 
> Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.
> 
>  
> 
> RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.
> 
>  
> 
> Jeff
> 
>  
> 
> From: Pamecha, Abhishek [mailto:apamecha@x.com] 
> Sent: Tuesday, October 16, 2012 11:28 AM
> To: user@hadoop.apache.org
> Subject: HDFS using SAN
> 
>  
> 
> Hi
> 
>  
> 
> I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.
> 
>  
> 
> We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.
> 
>  
> 
> Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?
> 
>  
> 
> And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> 
> 
> 
>  
> 
> -- 
> Have a Nice Day!
> Lohit
> 
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Kevin O'dell <ke...@cloudera.com>.

You may want to take a look at the Netapp White Paper on this.  They have a
SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393

On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:

>  Yes, for MR, my impression is typically the n/w utilization is next to
> none during map and reduce tasks but jumps during shuffle.  With a SAN, I
> would assume there is no such separation. There will be network activity
> all over the job’s time window with shuffle probably doing more than what
> it should. ****
>
> ** **
>
> Moreover, I hear typically SANs by default, would split data in different
> physical disks [even w/o RAID], so contiguity is lost. But I have no idea
> on if that is a good thing or bad. Looks bad on the surface, but probably
> depends on how much parallelized data fetches from multiple physical disks
> can be done by a SAN efficiently. Any comments on this aspect?****
>
> ** **
>
> And yes, when the dataset volume increases and one needs to basically do
> full table scan equivalents, I am assuming the n/w needs to support that
> entire data move from SAN to the data node all in parallel to different
> mappers.****
>
> ** **
>
> So what I am gathering is  although storing data over SAN is possible for
> a Hadoop installation, Map-shuffle-reduce may not be the best way to
> process data in that env. Is this conclusion correct? ****
>
> ** **
>
> <3 way Replication and RAID suggestions are great. ****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* lohit [mailto:lohit.vijayarenu@gmail.com]
> *Sent:* Tuesday, October 16, 2012 3:26 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS using SAN****
>
> ** **
>
> Adding to this. Locality is very important for MapReduce applications. One
> might not see much of a difference for small MapReduce jobs running on
> direct attached storage vs SAN, but when you cluster grows or you find jobs
> which are heavy on IO, you would see quite a bit of difference. One thing
> which is obviously is also cost difference. Argument for that has been that
> SAN storage is much more reliable so you do not need default of 3 way
> replication factor you would do on direct attached storage. ****
>
> ** **
>
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>****
>
> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
>  ****
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
>  ****
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
>  ****
>
> Jeff****
>
>  ****
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
>  ****
>
> Hi ****
>
>  ****
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
>  ****
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
>  ****
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
>  ****
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
>  ****
>
> Thanks,****
>
> Abhishek****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Have a Nice Day!
> Lohit****
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Kevin O'dell <ke...@cloudera.com>.

You may want to take a look at the Netapp White Paper on this.  They have a
SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393

On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:

>  Yes, for MR, my impression is typically the n/w utilization is next to
> none during map and reduce tasks but jumps during shuffle.  With a SAN, I
> would assume there is no such separation. There will be network activity
> all over the job’s time window with shuffle probably doing more than what
> it should. ****
>
> ** **
>
> Moreover, I hear typically SANs by default, would split data in different
> physical disks [even w/o RAID], so contiguity is lost. But I have no idea
> on if that is a good thing or bad. Looks bad on the surface, but probably
> depends on how much parallelized data fetches from multiple physical disks
> can be done by a SAN efficiently. Any comments on this aspect?****
>
> ** **
>
> And yes, when the dataset volume increases and one needs to basically do
> full table scan equivalents, I am assuming the n/w needs to support that
> entire data move from SAN to the data node all in parallel to different
> mappers.****
>
> ** **
>
> So what I am gathering is  although storing data over SAN is possible for
> a Hadoop installation, Map-shuffle-reduce may not be the best way to
> process data in that env. Is this conclusion correct? ****
>
> ** **
>
> <3 way Replication and RAID suggestions are great. ****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* lohit [mailto:lohit.vijayarenu@gmail.com]
> *Sent:* Tuesday, October 16, 2012 3:26 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS using SAN****
>
> ** **
>
> Adding to this. Locality is very important for MapReduce applications. One
> might not see much of a difference for small MapReduce jobs running on
> direct attached storage vs SAN, but when you cluster grows or you find jobs
> which are heavy on IO, you would see quite a bit of difference. One thing
> which is obviously is also cost difference. Argument for that has been that
> SAN storage is much more reliable so you do not need default of 3 way
> replication factor you would do on direct attached storage. ****
>
> ** **
>
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>****
>
> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
>  ****
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
>  ****
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
>  ****
>
> Jeff****
>
>  ****
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
>  ****
>
> Hi ****
>
>  ****
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
>  ****
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
>  ****
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
>  ****
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
>  ****
>
> Thanks,****
>
> Abhishek****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Have a Nice Day!
> Lohit****
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Kevin O'dell <ke...@cloudera.com>.

You may want to take a look at the Netapp White Paper on this.  They have a
SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393

On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:

>  Yes, for MR, my impression is typically the n/w utilization is next to
> none during map and reduce tasks but jumps during shuffle.  With a SAN, I
> would assume there is no such separation. There will be network activity
> all over the job’s time window with shuffle probably doing more than what
> it should. ****
>
> ** **
>
> Moreover, I hear typically SANs by default, would split data in different
> physical disks [even w/o RAID], so contiguity is lost. But I have no idea
> on if that is a good thing or bad. Looks bad on the surface, but probably
> depends on how much parallelized data fetches from multiple physical disks
> can be done by a SAN efficiently. Any comments on this aspect?****
>
> ** **
>
> And yes, when the dataset volume increases and one needs to basically do
> full table scan equivalents, I am assuming the n/w needs to support that
> entire data move from SAN to the data node all in parallel to different
> mappers.****
>
> ** **
>
> So what I am gathering is  although storing data over SAN is possible for
> a Hadoop installation, Map-shuffle-reduce may not be the best way to
> process data in that env. Is this conclusion correct? ****
>
> ** **
>
> <3 way Replication and RAID suggestions are great. ****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* lohit [mailto:lohit.vijayarenu@gmail.com]
> *Sent:* Tuesday, October 16, 2012 3:26 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS using SAN****
>
> ** **
>
> Adding to this. Locality is very important for MapReduce applications. One
> might not see much of a difference for small MapReduce jobs running on
> direct attached storage vs SAN, but when you cluster grows or you find jobs
> which are heavy on IO, you would see quite a bit of difference. One thing
> which is obviously is also cost difference. Argument for that has been that
> SAN storage is much more reliable so you do not need default of 3 way
> replication factor you would do on direct attached storage. ****
>
> ** **
>
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>****
>
> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
>  ****
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
>  ****
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
>  ****
>
> Jeff****
>
>  ****
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
>  ****
>
> Hi ****
>
>  ****
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
>  ****
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
>  ****
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
>  ****
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
>  ****
>
> Thanks,****
>
> Abhishek****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Have a Nice Day!
> Lohit****
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Posted by Kevin O'dell <ke...@cloudera.com>.

You may want to take a look at the Netapp White Paper on this.  They have a
SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393

On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <ap...@x.com> wrote:

>  Yes, for MR, my impression is typically the n/w utilization is next to
> none during map and reduce tasks but jumps during shuffle.  With a SAN, I
> would assume there is no such separation. There will be network activity
> all over the job’s time window with shuffle probably doing more than what
> it should. ****
>
> ** **
>
> Moreover, I hear typically SANs by default, would split data in different
> physical disks [even w/o RAID], so contiguity is lost. But I have no idea
> on if that is a good thing or bad. Looks bad on the surface, but probably
> depends on how much parallelized data fetches from multiple physical disks
> can be done by a SAN efficiently. Any comments on this aspect?****
>
> ** **
>
> And yes, when the dataset volume increases and one needs to basically do
> full table scan equivalents, I am assuming the n/w needs to support that
> entire data move from SAN to the data node all in parallel to different
> mappers.****
>
> ** **
>
> So what I am gathering is  although storing data over SAN is possible for
> a Hadoop installation, Map-shuffle-reduce may not be the best way to
> process data in that env. Is this conclusion correct? ****
>
> ** **
>
> <3 way Replication and RAID suggestions are great. ****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* lohit [mailto:lohit.vijayarenu@gmail.com]
> *Sent:* Tuesday, October 16, 2012 3:26 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS using SAN****
>
> ** **
>
> Adding to this. Locality is very important for MapReduce applications. One
> might not see much of a difference for small MapReduce jobs running on
> direct attached storage vs SAN, but when you cluster grows or you find jobs
> which are heavy on IO, you would see quite a bit of difference. One thing
> which is obviously is also cost difference. Argument for that has been that
> SAN storage is much more reliable so you do not need default of 3 way
> replication factor you would do on direct attached storage. ****
>
> ** **
>
> 2012/10/16 Jeffrey Buell <jb...@vmware.com>****
>
> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
>  ****
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
>  ****
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
>  ****
>
> Jeff****
>
>  ****
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
>  ****
>
> Hi ****
>
>  ****
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
>  ****
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
>  ****
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
>  ****
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
>  ****
>
> Thanks,****
>
> Abhishek****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Have a Nice Day!
> Lohit****
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek




--
Have a Nice Day!
Lohit

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek




--
Have a Nice Day!
Lohit

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek




--
Have a Nice Day!
Lohit

RE: HDFS using SAN

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Yes, for MR, my impression is typically the n/w utilization is next to none during map and reduce tasks but jumps during shuffle.  With a SAN, I would assume there is no such separation. There will be network activity all over the job’s time window with shuffle probably doing more than what it should.

Moreover, I hear typically SANs by default, would split data in different physical disks [even w/o RAID], so contiguity is lost. But I have no idea on if that is a good thing or bad. Looks bad on the surface, but probably depends on how much parallelized data fetches from multiple physical disks can be done by a SAN efficiently. Any comments on this aspect?

And yes, when the dataset volume increases and one needs to basically do full table scan equivalents, I am assuming the n/w needs to support that entire data move from SAN to the data node all in parallel to different mappers.

So what I am gathering is  although storing data over SAN is possible for a Hadoop installation, Map-shuffle-reduce may not be the best way to process data in that env. Is this conclusion correct?

<3 way Replication and RAID suggestions are great.

Thanks,
Abhishek

From: lohit [mailto:lohit.vijayarenu@gmail.com]
Sent: Tuesday, October 16, 2012 3:26 PM
To: user@hadoop.apache.org
Subject: Re: HDFS using SAN

Adding to this. Locality is very important for MapReduce applications. One might not see much of a difference for small MapReduce jobs running on direct attached storage vs SAN, but when you cluster grows or you find jobs which are heavy on IO, you would see quite a bit of difference. One thing which is obviously is also cost difference. Argument for that has been that SAN storage is much more reliable so you do not need default of 3 way replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>>
It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a “remote” access in HDFS will still have to go over the network to the host that owns the data.  “Local” access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you’ll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com<ma...@x.com>]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek




--
Have a Nice Day!
Lohit

Re: HDFS using SAN

Posted by lohit <lo...@gmail.com>.

Adding to this. Locality is very important for MapReduce applications. One
might not see much of a difference for small MapReduce jobs running on
direct attached storage vs SAN, but when you cluster grows or you find jobs
which are heavy on IO, you would see quite a bit of difference. One thing
which is obviously is also cost difference. Argument for that has been that
SAN storage is much more reliable so you do not need default of 3 way
replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>

> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
> ** **
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
> ** **
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
> ** **
>
> Jeff****
>
> ** **
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
> ** **
>
> Hi ****
>
> ** **
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
> ** **
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
> ** **
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
> ** **
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>



-- 
Have a Nice Day!
Lohit

Re: HDFS using SAN

Posted by lohit <lo...@gmail.com>.

Adding to this. Locality is very important for MapReduce applications. One
might not see much of a difference for small MapReduce jobs running on
direct attached storage vs SAN, but when you cluster grows or you find jobs
which are heavy on IO, you would see quite a bit of difference. One thing
which is obviously is also cost difference. Argument for that has been that
SAN storage is much more reliable so you do not need default of 3 way
replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>

> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
> ** **
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
> ** **
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
> ** **
>
> Jeff****
>
> ** **
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
> ** **
>
> Hi ****
>
> ** **
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
> ** **
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
> ** **
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
> ** **
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>



-- 
Have a Nice Day!
Lohit

Re: HDFS using SAN

Posted by lohit <lo...@gmail.com>.

Adding to this. Locality is very important for MapReduce applications. One
might not see much of a difference for small MapReduce jobs running on
direct attached storage vs SAN, but when you cluster grows or you find jobs
which are heavy on IO, you would see quite a bit of difference. One thing
which is obviously is also cost difference. Argument for that has been that
SAN storage is much more reliable so you do not need default of 3 way
replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>

> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
> ** **
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
> ** **
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
> ** **
>
> Jeff****
>
> ** **
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
> ** **
>
> Hi ****
>
> ** **
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
> ** **
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
> ** **
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
> ** **
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>



-- 
Have a Nice Day!
Lohit

Re: HDFS using SAN

Posted by lohit <lo...@gmail.com>.

Adding to this. Locality is very important for MapReduce applications. One
might not see much of a difference for small MapReduce jobs running on
direct attached storage vs SAN, but when you cluster grows or you find jobs
which are heavy on IO, you would see quite a bit of difference. One thing
which is obviously is also cost difference. Argument for that has been that
SAN storage is much more reliable so you do not need default of 3 way
replication factor you would do on direct attached storage.

2012/10/16 Jeffrey Buell <jb...@vmware.com>

> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
> ** **
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
> ** **
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
> ** **
>
> Jeff****
>
> ** **
>
> *From:* Pamecha, Abhishek [mailto:apamecha@x.com]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* HDFS using SAN****
>
> ** **
>
> Hi ****
>
> ** **
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
> ** **
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
> ** **
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
> ** **
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>



-- 
Have a Nice Day!
Lohit

RE: HDFS using SAN

Posted by Jeffrey Buell <jb...@vmware.com>.

It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

RE: HDFS using SAN

Posted by Jeffrey Buell <jb...@vmware.com>.

It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

RE: HDFS using SAN

Posted by Jeffrey Buell <jb...@vmware.com>.

It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

RE: HDFS using SAN

Posted by Jeffrey Buell <jb...@vmware.com>.

It will be difficult to make a SAN work well for Hadoop, but not impossible.  I have done direct comparisons (but not published them yet).  Direct local storage is likely to have much more capacity and more total bandwidth.  But you can do pretty well with a SAN if you stuff it with the highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE connection for every host.  Watch out for overall SAN bandwidth limits (which may well be much less than the sum of the capacity of the wires connected to it).  There will definitely be a hard limit to how many hosts you connect to a single SAN.  Scaling to larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to all the data, a "remote" access in HDFS will still have to go over the network to the host that owns the data.  "Local" access is fine with the constraints above.

RAID is not good for Hadoop performance for both local and SAN storage, so you'll want to configure one LUN for each physical disk in the SAN.  If you do have mirroring or RAID on the SAN, you may be tempted to use that to replace Hadoop replication.  But while the data is protected, access to the data is lost if the datanode goes down.  You can get around that by running the datanode in a VM which is stored on the SAN and using VMware HA to automatically restart the VM on another host in case of a failure.  Hortonworks has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:apamecha@x.com]
Sent: Tuesday, October 16, 2012 11:28 AM
To: user@hadoop.apache.org
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek