You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Chiradeep Vittal <Ch...@citrix.com> on 2013/06/11 22:06:51 UTC
Re: Hadoop cluster running in cloudstack
Taking it to dev@ to see if there is any interest.
It is a good and interesting requirement. I can see hacking 'pre-setup'
storage with tags to achieve this, but it is going to be a fragile hack.
I believe GCE also has the concept of some instance types having dedicated
spindles.
On 6/6/13 11:14 AM, "David Ortiz" <dp...@outlook.com> wrote:
>Chiradeep,
> Currently I am working with KVM hypervisor nodes. The use case of
>having 4 spindles and assigning one to each node is exactly what I would
>like to do. For the moment I have all four spindles configured in a RAID
>with the cloudstack local storage pointed at it.
>Shanker,
> I had not seen that slideshow yet, so thank you for pointing me to
>it. As of now, the hadoop resources I am using are statically allocated
>between 4 hosts. As it stands now, I am constrained to those resources
>without the ability to add any additional storage cluster (or additional
>storage to my current shared storage appliance), or additional nodes.
>Fortunately, my use cases don't require any kind of reallocation of the
>hadoop nodes. It's more clients for the cluster as well as web service
>nodes that run clients that are being dynamically spun up and down. I
>have found that I can get through my jobs alright, they just take a lot
>of extra time to run since I have the storage acting as a bottleneck
>right now.
>Thanks, David Ortiz
>
>> From: runseb@gmail.com
>> Subject: Re: Hadoop cluster running in cloudstack
>> Date: Thu, 6 Jun 2013 10:23:50 -0400
>> To: users@cloudstack.apache.org
>>
>>
>> On Jun 6, 2013, at 4:05 AM, Shanker Balan <sh...@shapeblue.com>
>>wrote:
>>
>> > On 05-Jun-2013, at 12:13 AM, David Ortiz <dp...@outlook.com> wrote:
>> >
>> >> Hello,
>> >> Has anyone tried running a hadoop cluster in a cloudstack
>>environment? I have set one up, but I am finding that I am having some
>>IO contention between slave nodes on each host since they all share one
>>local storage pool. As I understand it, there is not currently a method
>>for using multiple local storage pools with VMs through cloudstack. Has
>>anyone found a workaround for this by any chance?
>> >
>> >
>> > Hi David,
>> >
>> > Have you seen Seb's
>>http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata
>>slides yet?
>>
>> As a quick disclaimer, the various configurations I highlight in this
>>deck are a bit hand wavy and I did not test them. I just made a guess
>>about how one might want to use the baremetal functionality in
>>cloudstack. The main distinction being between using a "big data" store
>>as storage backends of cloudstack and using cloudstack to provision a
>>bigdata store on-demand.
>>
>> -sebastien
>>
>> >
>> > In my experience running Hadoop (100+ nodes) on traditional servers,
>>its going to be really hard to scale up Hadoop workloads using local
>>storage and HDFS on a cloud.
>> >
>> > I ran out of IOPS very quickly. There was enough CPU headroom but
>>could not add more slots as disk became the bottleneck. Every time there
>>was a node/disk failure, rebalancing was a nightmare with a 3x HDFS
>>replication factor.
>> >
>> > If I were to run Hadoop on an IaaS cloud, I would do it very similar
>>to Amazon AWS EMR - instances backed by a "Storage As A Service" layer
>>(S3) for big data instead of HDFS.
>> >
>> > The system would work as below:
>> >
>> > - Create a dedicated big data storage tier using a distributed
>>filesystem like Gluster/Ceph/Isilon. Most of the vendors now provide S3
>>compat connectors for Hadoop.
>> >
>> > http://ceph.com/docs/master/cephfs/hadoop/
>> > http://gluster.org/community/documentation/index.php/Hadoop
>> > http://www.emc.com/big-data/scale-out-storage-hadoop.htm
>> >
>> > - Hadoop instances are spun up on bare metal or on hypervisors. The
>>service offerings for "big data" instances could will run on dedicated
>>hypervisors (via tags) with high bandwidth network connectivity to the
>>storage service.
>> >
>> > - Hadoop instances use Local storage for run time data.
>> >
>> > - Hadoop VMs connect to the storage tier via connectors for permanent
>>storage
>> >
>> > Benefits:
>> >
>> > - Spinning up/down VMs don't cause HDFS rebalancing as there is no
>>HDFS anywhere.
>> >
>> > - Scale out VMs independently of storage. Add more spindles / nodes
>>to the storage cluster to scale out IOPS and capacity
>> >
>> > - Easy upgrade of Hadoop releases without risk to data
>> >
>> > Regards.
>> > @shankerbalan
>> >
>> > --
>> > Shanker Balan
>> > Managing Consultant
>> >
>> >
>> >
>> > M: +91 98860 60539
>> > shanker.balan@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue
>> > ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre,
>>Bangalore - 560 055
>> >
>> > This email and any attachments to it may be confidential and are
>>intended solely for the use of the individual to whom it is addressed.
>>Any views or opinions expressed are solely those of the author and do
>>not necessarily represent those of Shape Blue Ltd or related companies.
>>If you are not the intended recipient of this email, you must neither
>>take any action based upon its contents, nor copy or show it to anyone.
>>Please contact the sender if you believe you have received this email in
>>error. Shape Blue Ltd is a company incorporated in England & Wales.
>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>Ltd. ShapeBlue is a registered trademark.
>>
>
RE: Hadoop cluster running in cloudstack
Posted by Alex Huang <Al...@citrix.com>.
I thought about this a bit yesterday after Chiradeep talked to me.
The first fix is definitely allow multiple local storage per host. That requires some work on cloudstack but I don't see it as a big problem.
Then a storage-pool allocator can be written such that it always allocates separate local storage pools to vms on the same host. That should be minimal work and can be taken as a side project.
--Alex
> -----Original Message-----
> From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
> Sent: Tuesday, June 11, 2013 1:07 PM
> To: dev@cloudstack.apache.org
> Subject: Re: Hadoop cluster running in cloudstack
>
> Taking it to dev@ to see if there is any interest.
>
>
> It is a good and interesting requirement. I can see hacking 'pre-setup'
> storage with tags to achieve this, but it is going to be a fragile hack.
> I believe GCE also has the concept of some instance types having dedicated
> spindles.
>
>
> On 6/6/13 11:14 AM, "David Ortiz" <dp...@outlook.com> wrote:
>
> >Chiradeep,
> > Currently I am working with KVM hypervisor nodes. The use case of
> >having 4 spindles and assigning one to each node is exactly what I
> >would like to do. For the moment I have all four spindles configured
> >in a RAID with the cloudstack local storage pointed at it.
> >Shanker,
> > I had not seen that slideshow yet, so thank you for pointing me
> >to it. As of now, the hadoop resources I am using are statically
> >allocated between 4 hosts. As it stands now, I am constrained to those
> >resources without the ability to add any additional storage cluster (or
> >additional storage to my current shared storage appliance), or additional
> nodes.
> >Fortunately, my use cases don't require any kind of reallocation of the
> >hadoop nodes. It's more clients for the cluster as well as web service
> >nodes that run clients that are being dynamically spun up and down. I
> >have found that I can get through my jobs alright, they just take a lot
> >of extra time to run since I have the storage acting as a bottleneck
> >right now.
> >Thanks, David Ortiz
> >
> >> From: runseb@gmail.com
> >> Subject: Re: Hadoop cluster running in cloudstack
> >> Date: Thu, 6 Jun 2013 10:23:50 -0400
> >> To: users@cloudstack.apache.org
> >>
> >>
> >> On Jun 6, 2013, at 4:05 AM, Shanker Balan
> >><sh...@shapeblue.com>
> >>wrote:
> >>
> >> > On 05-Jun-2013, at 12:13 AM, David Ortiz <dp...@outlook.com> wrote:
> >> >
> >> >> Hello,
> >> >> Has anyone tried running a hadoop cluster in a cloudstack
> >>environment? I have set one up, but I am finding that I am having
> >>some IO contention between slave nodes on each host since they all
> >>share one local storage pool. As I understand it, there is not
> >>currently a method for using multiple local storage pools with VMs
> >>through cloudstack. Has anyone found a workaround for this by any
> chance?
> >> >
> >> >
> >> > Hi David,
> >> >
> >> > Have you seen Seb's
> >>http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata
> >>slides yet?
> >>
> >> As a quick disclaimer, the various configurations I highlight in this
> >>deck are a bit hand wavy and I did not test them. I just made a guess
> >>about how one might want to use the baremetal functionality in
> >>cloudstack. The main distinction being between using a "big data"
> >>store as storage backends of cloudstack and using cloudstack to
> >>provision a bigdata store on-demand.
> >>
> >> -sebastien
> >>
> >> >
> >> > In my experience running Hadoop (100+ nodes) on traditional
> >> > servers,
> >>its going to be really hard to scale up Hadoop workloads using local
> >>storage and HDFS on a cloud.
> >> >
> >> > I ran out of IOPS very quickly. There was enough CPU headroom but
> >>could not add more slots as disk became the bottleneck. Every time
> >>there was a node/disk failure, rebalancing was a nightmare with a 3x
> >>HDFS replication factor.
> >> >
> >> > If I were to run Hadoop on an IaaS cloud, I would do it very
> >> > similar
> >>to Amazon AWS EMR - instances backed by a "Storage As A Service" layer
> >>(S3) for big data instead of HDFS.
> >> >
> >> > The system would work as below:
> >> >
> >> > - Create a dedicated big data storage tier using a distributed
> >>filesystem like Gluster/Ceph/Isilon. Most of the vendors now provide
> >>S3 compat connectors for Hadoop.
> >> >
> >> > http://ceph.com/docs/master/cephfs/hadoop/
> >> > http://gluster.org/community/documentation/index.php/Hadoop
> >> > http://www.emc.com/big-data/scale-out-storage-hadoop.htm
> >> >
> >> > - Hadoop instances are spun up on bare metal or on hypervisors. The
> >>service offerings for "big data" instances could will run on dedicated
> >>hypervisors (via tags) with high bandwidth network connectivity to the
> >>storage service.
> >> >
> >> > - Hadoop instances use Local storage for run time data.
> >> >
> >> > - Hadoop VMs connect to the storage tier via connectors for
> >> > permanent
> >>storage
> >> >
> >> > Benefits:
> >> >
> >> > - Spinning up/down VMs don't cause HDFS rebalancing as there is no
> >>HDFS anywhere.
> >> >
> >> > - Scale out VMs independently of storage. Add more spindles / nodes
> >>to the storage cluster to scale out IOPS and capacity
> >> >
> >> > - Easy upgrade of Hadoop releases without risk to data
> >> >
> >> > Regards.
> >> > @shankerbalan
> >> >
> >> > --
> >> > Shanker Balan
> >> > Managing Consultant
> >> >
> >> >
> >> >
> >> > M: +91 98860 60539
> >> > shanker.balan@shapeblue.com | www.shapeblue.com |
> >> > Twitter:@shapeblue ShapeBlue India, 22nd floor, Unit 2201A, World
> >> > Trade Centre,
> >>Bangalore - 560 055
> >> >
> >> > This email and any attachments to it may be confidential and are
> >>intended solely for the use of the individual to whom it is addressed.
> >>Any views or opinions expressed are solely those of the author and do
> >>not necessarily represent those of Shape Blue Ltd or related companies.
> >>If you are not the intended recipient of this email, you must neither
> >>take any action based upon its contents, nor copy or show it to anyone.
> >>Please contact the sender if you believe you have received this email
> >>in error. Shape Blue Ltd is a company incorporated in England & Wales.
> >>ShapeBlue Services India LLP is operated under license from Shape Blue
> >>Ltd. ShapeBlue is a registered trademark.
> >>
> >