You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Adarsh Sharma <ad...@orkash.com> on 2010/12/09 11:29:25 UTC

Hadoop on Cloud or Not

Hello,

I have Eucalyptus 1.6.2 installed on ubuntu 10.04 using source 
installation with kvm. Currently I have ten nodes in my cloud in a 
single cluster architecture.
Also I have tested Hadoop on VM's and run several  jobs

I am trying to run Hadoop in a cloud environment. So I will launch 
hadoop instances on the cloud. Now there is huge data on each Hadoop 
node so I am planning to use volumes as of now to store that data of 
each instance i.e Hadoop node. But since volumes are stored at Storage 
controllers so this means that there is continuous movement of data 
(lots of GBs) in cloud network from SC to node and also the response 
time of work done on Hadoop instances will be slow due to time taken by 
data to travel in the network.

So, now is it possible to store volumes (or any other way) on the nodes 
so that above problem can be resolved.

Second case : I can store data on the hard disk attached to the nodes 
and Hadoop instances can access that data easily but for that I would be 
required to start the instances on the node where data has been stored. 
So for this can I by using any hack or by anything decide the node for a 
instance to be started.

Can anyone who has some working experience with Hadoop on cloud 
environment give me any pointers?
I will really appreciate any sort of support on this.

Finally is it worthful to do this as I previously recieve some response 
like this :

> Is it possible to run Hadoop in VMs on Production Clusters so that we
> have 10000s of nodes on 100s of servers to achieve high performance
> through Cloud Computing.

you don't achieve performance that way. You are better off with 1VM per 
physical host, and you will need to talk to a persistent filestore for 
the data you want to retain. Running >1 VM per physical host just 
creates conflict for things like disk, ether and CPU that the virtual OS 
won't be aware of. Also, VM to disk performance is pretty bad right now, 
though that's improving.


Thanks & Regards

Adarsh Sharma

Re: Hadoop on Cloud or Not

Posted by Kiss Tibor <ki...@gmail.com>.

The data locality of HDFS will optimize your transfer problem. But at least
once you need to transfer from persistent storage to HDFS. But you should
make a backup plan from HFDS to persistent storage.
I think that decoupling the persistent storage from Hadoop cluster is a good
ideea, because you are using cloud to be scale out or shrink back as
necessary which scaleout is more dynamic compared to your data storage.
Usually data storage scaling depends on the amount of data and transfer
rate, and Hadoop cluster scaling depends on necessary computing power. If
the storage remains a decoupled system, then you can scale out and shrink
back efficiently.
Even more, I recommend that HDFS you can keep on ephemeral storage (and
instance store type instances) in case of Amazon EC2 and use S3 or some NAS
to download and upload your data to the HDFS.
Just don't forget to do the plan of data transfer to reduce the number of
transferred data from persistent storage to hdfs and viceversa.

Cheers,
Tibor

On Thu, Dec 9, 2010 at 11:29 AM, Adarsh Sharma <ad...@orkash.com>wrote:

> Hello,
>
> I have Eucalyptus 1.6.2 installed on ubuntu 10.04 using source installation
> with kvm. Currently I have ten nodes in my cloud in a single cluster
> architecture.
> Also I have tested Hadoop on VM's and run several  jobs
>
> I am trying to run Hadoop in a cloud environment. So I will launch hadoop
> instances on the cloud. Now there is huge data on each Hadoop node so I am
> planning to use volumes as of now to store that data of each instance i.e
> Hadoop node. But since volumes are stored at Storage controllers so this
> means that there is continuous movement of data (lots of GBs) in cloud
> network from SC to node and also the response time of work done on Hadoop
> instances will be slow due to time taken by data to travel in the network.
>
> So, now is it possible to store volumes (or any other way) on the nodes so
> that above problem can be resolved.
>
> Second case : I can store data on the hard disk attached to the nodes and
> Hadoop instances can access that data easily but for that I would be
> required to start the instances on the node where data has been stored. So
> for this can I by using any hack or by anything decide the node for a
> instance to be started.
>
> Can anyone who has some working experience with Hadoop on cloud environment
> give me any pointers?
> I will really appreciate any sort of support on this.
>
> Finally is it worthful to do this as I previously recieve some response
> like this :
>
>  Is it possible to run Hadoop in VMs on Production Clusters so that we
>> have 10000s of nodes on 100s of servers to achieve high performance
>> through Cloud Computing.
>>
>
> you don't achieve performance that way. You are better off with 1VM per
> physical host, and you will need to talk to a persistent filestore for the
> data you want to retain. Running >1 VM per physical host just creates
> conflict for things like disk, ether and CPU that the virtual OS won't be
> aware of. Also, VM to disk performance is pretty bad right now, though
> that's improving.
>
>
> Thanks & Regards
>
> Adarsh Sharma
>
>