You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by Chiradeep Vittal <Ch...@citrix.com> on 2013/06/11 22:06:51 UTC

Re: Hadoop cluster running in cloudstack

Taking it to dev@ to see if there is any interest.


It is a good and interesting requirement. I can see hacking 'pre-setup'
storage with tags to achieve this, but it is going to be a fragile hack.
I believe GCE also has the concept of some instance types having dedicated
spindles.


On 6/6/13 11:14 AM, "David Ortiz" <dp...@outlook.com> wrote:

>Chiradeep,
>     Currently I am working with KVM hypervisor nodes.  The use case of
>having 4 spindles and assigning one to each node is exactly what I would
>like to do.  For the moment I have all four spindles configured in a RAID
>with the cloudstack local storage pointed at it.
>Shanker,
>      I had not seen that slideshow yet, so thank you for pointing me to
>it.  As of now, the hadoop resources I am using are statically allocated
>between 4 hosts.  As it stands now, I am constrained to those resources
>without the ability to add any additional storage cluster (or additional
>storage to my current shared storage appliance), or additional nodes.
>Fortunately, my use cases don't require any kind of reallocation of the
>hadoop nodes.  It's more clients for the cluster as well as web service
>nodes that run clients that are being dynamically spun up and down.  I
>have found that I can get through my jobs alright, they just take a lot
>of extra time to run since I have the storage acting as a bottleneck
>right now.
>Thanks,     David Ortiz
>
>> From: runseb@gmail.com
>> Subject: Re: Hadoop cluster running in cloudstack
>> Date: Thu, 6 Jun 2013 10:23:50 -0400
>> To: users@cloudstack.apache.org
>> 
>> 
>> On Jun 6, 2013, at 4:05 AM, Shanker Balan <sh...@shapeblue.com>
>>wrote:
>> 
>> > On 05-Jun-2013, at 12:13 AM, David Ortiz <dp...@outlook.com> wrote:
>> > 
>> >> Hello,
>> >>    Has anyone tried running a hadoop cluster in a cloudstack
>>environment?  I have set one up, but I am finding that I am having some
>>IO contention between slave nodes on each host since they all share one
>>local storage pool.  As I understand it, there is not currently a method
>>for using multiple local storage pools with VMs through cloudstack.  Has
>>anyone found a workaround for this by any chance?
>> > 
>> > 
>> > Hi David,
>> > 
>> > Have you seen Seb's
>>http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata
>>slides yet?
>> 
>> As a quick disclaimer, the various configurations I highlight in this
>>deck are a bit hand wavy and I did not test them. I just made a guess
>>about how one might want to use the baremetal functionality in
>>cloudstack. The main distinction being between using a "big data" store
>>as storage backends of cloudstack and using cloudstack to provision a
>>bigdata store on-demand.
>> 
>> -sebastien
>> 
>> > 
>> > In my experience running Hadoop (100+ nodes) on traditional servers,
>>its going to be really hard to scale up Hadoop workloads using local
>>storage and HDFS on a cloud.
>> > 
>> > I ran out of IOPS very quickly. There was enough CPU headroom but
>>could not add more slots as disk became the bottleneck. Every time there
>>was a node/disk failure, rebalancing was a nightmare with a 3x HDFS
>>replication factor.
>> > 
>> > If I were to run Hadoop on an IaaS cloud, I would do it very similar
>>to Amazon AWS EMR - instances backed by a "Storage As A Service" layer
>>(S3) for big data instead of HDFS.
>> > 
>> > The system would work as below:
>> > 
>> > - Create a dedicated big data storage tier using a distributed
>>filesystem like Gluster/Ceph/Isilon. Most of the vendors now provide S3
>>compat connectors for Hadoop.
>> > 
>> > http://ceph.com/docs/master/cephfs/hadoop/
>> > http://gluster.org/community/documentation/index.php/Hadoop
>> > http://www.emc.com/big-data/scale-out-storage-hadoop.htm
>> > 
>> > - Hadoop instances are spun up on bare metal or on hypervisors. The
>>service offerings for "big data" instances could will run on dedicated
>>hypervisors (via tags) with high bandwidth network connectivity to the
>>storage service.
>> > 
>> > - Hadoop instances use Local storage for run time data.
>> > 
>> > - Hadoop VMs connect to the storage tier via connectors for permanent
>>storage
>> > 
>> > Benefits:
>> > 
>> > - Spinning up/down VMs don't cause HDFS rebalancing as there is no
>>HDFS anywhere.
>> > 
>> > - Scale out VMs independently of storage. Add more spindles / nodes
>>to the storage cluster to scale out IOPS and capacity
>> > 
>> > - Easy upgrade of Hadoop releases without risk to data
>> > 
>> > Regards.
>> > @shankerbalan
>> > 
>> > -- 
>> > Shanker Balan
>> > Managing Consultant
>> > 
>> > 
>> > 
>> > M: +91 98860 60539
>> > shanker.balan@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue
>> > ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre,
>>Bangalore - 560 055
>> > 
>> > This email and any attachments to it may be confidential and are
>>intended solely for the use of the individual to whom it is addressed.
>>Any views or opinions expressed are solely those of the author and do
>>not necessarily represent those of Shape Blue Ltd or related companies.
>>If you are not the intended recipient of this email, you must neither
>>take any action based upon its contents, nor copy or show it to anyone.
>>Please contact the sender if you believe you have received this email in
>>error. Shape Blue Ltd is a company incorporated in England & Wales.
>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>Ltd. ShapeBlue is a registered trademark.
>> 
>

RE: Hadoop cluster running in cloudstack

Posted by Alex Huang <Al...@citrix.com>.

I thought about this a bit yesterday after Chiradeep talked to me.

The first fix is definitely allow multiple local storage per host.  That requires some work on cloudstack but I don't see it as a big problem.

Then a storage-pool allocator can be written such that it always allocates separate local storage pools to vms on the same host.  That should be minimal work and can be taken as a side project.

--Alex

> -----Original Message-----
> From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
> Sent: Tuesday, June 11, 2013 1:07 PM
> To: dev@cloudstack.apache.org
> Subject: Re: Hadoop cluster running in cloudstack
> 
> Taking it to dev@ to see if there is any interest.
> 
> 
> It is a good and interesting requirement. I can see hacking 'pre-setup'
> storage with tags to achieve this, but it is going to be a fragile hack.
> I believe GCE also has the concept of some instance types having dedicated
> spindles.
> 
> 
> On 6/6/13 11:14 AM, "David Ortiz" <dp...@outlook.com> wrote:
> 
> >Chiradeep,
> >     Currently I am working with KVM hypervisor nodes.  The use case of
> >having 4 spindles and assigning one to each node is exactly what I
> >would like to do.  For the moment I have all four spindles configured
> >in a RAID with the cloudstack local storage pointed at it.
> >Shanker,
> >      I had not seen that slideshow yet, so thank you for pointing me
> >to it.  As of now, the hadoop resources I am using are statically
> >allocated between 4 hosts.  As it stands now, I am constrained to those
> >resources without the ability to add any additional storage cluster (or
> >additional storage to my current shared storage appliance), or additional
> nodes.
> >Fortunately, my use cases don't require any kind of reallocation of the
> >hadoop nodes.  It's more clients for the cluster as well as web service
> >nodes that run clients that are being dynamically spun up and down.  I
> >have found that I can get through my jobs alright, they just take a lot
> >of extra time to run since I have the storage acting as a bottleneck
> >right now.
> >Thanks,     David Ortiz
> >
> >> From: runseb@gmail.com
> >> Subject: Re: Hadoop cluster running in cloudstack
> >> Date: Thu, 6 Jun 2013 10:23:50 -0400
> >> To: users@cloudstack.apache.org
> >>
> >>
> >> On Jun 6, 2013, at 4:05 AM, Shanker Balan
> >><sh...@shapeblue.com>
> >>wrote:
> >>
> >> > On 05-Jun-2013, at 12:13 AM, David Ortiz <dp...@outlook.com> wrote:
> >> >
> >> >> Hello,
> >> >>    Has anyone tried running a hadoop cluster in a cloudstack
> >>environment?  I have set one up, but I am finding that I am having
> >>some IO contention between slave nodes on each host since they all
> >>share one local storage pool.  As I understand it, there is not
> >>currently a method for using multiple local storage pools with VMs
> >>through cloudstack.  Has anyone found a workaround for this by any
> chance?
> >> >
> >> >
> >> > Hi David,
> >> >
> >> > Have you seen Seb's
> >>http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata
> >>slides yet?
> >>
> >> As a quick disclaimer, the various configurations I highlight in this
> >>deck are a bit hand wavy and I did not test them. I just made a guess
> >>about how one might want to use the baremetal functionality in
> >>cloudstack. The main distinction being between using a "big data"
> >>store as storage backends of cloudstack and using cloudstack to
> >>provision a bigdata store on-demand.
> >>
> >> -sebastien
> >>
> >> >
> >> > In my experience running Hadoop (100+ nodes) on traditional
> >> > servers,
> >>its going to be really hard to scale up Hadoop workloads using local
> >>storage and HDFS on a cloud.
> >> >
> >> > I ran out of IOPS very quickly. There was enough CPU headroom but
> >>could not add more slots as disk became the bottleneck. Every time
> >>there was a node/disk failure, rebalancing was a nightmare with a 3x
> >>HDFS replication factor.
> >> >
> >> > If I were to run Hadoop on an IaaS cloud, I would do it very
> >> > similar
> >>to Amazon AWS EMR - instances backed by a "Storage As A Service" layer
> >>(S3) for big data instead of HDFS.
> >> >
> >> > The system would work as below:
> >> >
> >> > - Create a dedicated big data storage tier using a distributed
> >>filesystem like Gluster/Ceph/Isilon. Most of the vendors now provide
> >>S3 compat connectors for Hadoop.
> >> >
> >> > http://ceph.com/docs/master/cephfs/hadoop/
> >> > http://gluster.org/community/documentation/index.php/Hadoop
> >> > http://www.emc.com/big-data/scale-out-storage-hadoop.htm
> >> >
> >> > - Hadoop instances are spun up on bare metal or on hypervisors. The
> >>service offerings for "big data" instances could will run on dedicated
> >>hypervisors (via tags) with high bandwidth network connectivity to the
> >>storage service.
> >> >
> >> > - Hadoop instances use Local storage for run time data.
> >> >
> >> > - Hadoop VMs connect to the storage tier via connectors for
> >> > permanent
> >>storage
> >> >
> >> > Benefits:
> >> >
> >> > - Spinning up/down VMs don't cause HDFS rebalancing as there is no
> >>HDFS anywhere.
> >> >
> >> > - Scale out VMs independently of storage. Add more spindles / nodes
> >>to the storage cluster to scale out IOPS and capacity
> >> >
> >> > - Easy upgrade of Hadoop releases without risk to data
> >> >
> >> > Regards.
> >> > @shankerbalan
> >> >
> >> > --
> >> > Shanker Balan
> >> > Managing Consultant
> >> >
> >> >
> >> >
> >> > M: +91 98860 60539
> >> > shanker.balan@shapeblue.com | www.shapeblue.com |
> >> > Twitter:@shapeblue ShapeBlue India, 22nd floor, Unit 2201A, World
> >> > Trade Centre,
> >>Bangalore - 560 055
> >> >
> >> > This email and any attachments to it may be confidential and are
> >>intended solely for the use of the individual to whom it is addressed.
> >>Any views or opinions expressed are solely those of the author and do
> >>not necessarily represent those of Shape Blue Ltd or related companies.
> >>If you are not the intended recipient of this email, you must neither
> >>take any action based upon its contents, nor copy or show it to anyone.
> >>Please contact the sender if you believe you have received this email
> >>in error. Shape Blue Ltd is a company incorporated in England & Wales.
> >>ShapeBlue Services India LLP is operated under license from Shape Blue
> >>Ltd. ShapeBlue is a registered trademark.
> >>
> >