You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Brad C <br...@gmail.com> on 2008/06/06 16:30:43 UTC

Hadoop Distributed Virtualisation

Hello Everyone,

I've been brainstorming recently and its always been in the back of my
mind, hadoop offers the functionality of clustering comodity systems
together, but how would one go about virtualising them apart again?

Kind Regards

Brad :)

RE: Hadoop Distributed Virtualisation

Posted by je...@orange-ftgroup.com.

I see the VM approach great for isolation, customized hadoop or tools
required by the jobs and ease of IT management.

Performance hit on CPU and IO is there but I never looked at the
numbers.
Anybody did?

Basically for now, on EC2 for instance, if you need to go faster, just
buy 50 more machines for a couple of hours... ;)

Re: Hadoop Distributed Virtualisation

Posted by Colin Freas <co...@gmail.com>.

The MR jobs I'm performing are not CPU intensive, so I've always assumed
that they're more IO bound.  Maybe that's an exceptional situation, but I'm
not really sure.

A good motherboard with a local IO channel per disk, feeding individual
cores, with memory partitioned up between them...  and I've heard good
things about Intel's next tock vis-a-vis internal system throughput.

And yes, this would be a task for a paravirtualization system like Xen.
Again, it's just a thought, but with low end quad core proc's running about
$300, and the potential to cut the number of machines you need to physically
setup by 75%, I'm not sure I'd say it'd only be good for a proof of
concept.

Also, I just set up a dozen odd boxes that are two generations behind modern
boxes, and promptly blew a fuse.  The TDP on the Xeon 3.06Ghz chips I'm
using is 89W.  The TDP on an Intel Q6600 is 65W, and it represents 4 cores.

It's a simple experiment, but I don't have the resources on hand to run it.
I'm curious if anyone has seen the performance impact from the different
setups we're talking about.  I also think you could come close to faking it
with Hadoop config changes.

-Colin

On Fri, Jun 6, 2008 at 12:41 PM, Edward Capriolo <ed...@gmail.com>
wrote:

> I once asked a wise man in change of a rather large multi-datacenter
> service, "Have you every considered virtualization?" He replied, "All
> the CPU's here are pegged at 100%"
>
> They may be applications for this type of processing. I have thought
> about systems like this from time to time. This thinking goes in
> circles. Hadoop is designed for storing and processing on different
> hardware.  Virtualization lets you split a system into sub-systems.
>
> Virtualization is great for proof of concept.
> For example, I have deployed this: I installed VMware with two linux
> systems on my windows host, I followed a hadoop multi-system-tutorial
> running on two vmware nodes. I was able to get the word count
> application working, I also confirmed that blocks were indeed being
> stored on both virtual systems and that processing was being shared
> via MAP/REDUCE.
>
> The processing however was slow, of course this is the fault of
> VMware. VMware has a very high emulation overhead. Xen has less
> overhead. LinuxVserver and OpenVZ use software virtualization (they
> have very little (almost no) overhead). Regardless of how much
> overhead, overhead is overhead. Personally I find the Vmware falls
> short of its promises
>

Re: Hadoop Distributed Virtualisation

Posted by Edward Capriolo <ed...@gmail.com>.

There is a place for virtualization. If you can justify the overhead
for hands free network management. If your systems really have enough
power they can run hadoop and something else. If you need to run
multiple truly isolated versions of hadoop (selling some type of
hadoop grid services?)

If you are working on the type of application that requires a system
like hadoop, the implication is that you have a large network of
physical hardware and you need all the power you can get. It is
counter intuitive to start splitting up hosts into virtual machines,
however every application is different and it may make sense.

Re: Hadoop Distributed Virtualisation

Posted by Edward Capriolo <ed...@gmail.com>.

I once asked a wise man in change of a rather large multi-datacenter
service, "Have you every considered virtualization?" He replied, "All
the CPU's here are pegged at 100%"

They may be applications for this type of processing. I have thought
about systems like this from time to time. This thinking goes in
circles. Hadoop is designed for storing and processing on different
hardware.  Virtualization lets you split a system into sub-systems.

Virtualization is great for proof of concept.
For example, I have deployed this: I installed VMware with two linux
systems on my windows host, I followed a hadoop multi-system-tutorial
running on two vmware nodes. I was able to get the word count
application working, I also confirmed that blocks were indeed being
stored on both virtual systems and that processing was being shared
via MAP/REDUCE.

The processing however was slow, of course this is the fault of
VMware. VMware has a very high emulation overhead. Xen has less
overhead. LinuxVserver and OpenVZ use software virtualization (they
have very little (almost no) overhead). Regardless of how much
overhead, overhead is overhead. Personally I find the Vmware falls
short of its promises

Re: Hadoop Distributed Virtualisation

Posted by Brad C <br...@gmail.com>.

Hi Colin,

I think this would work as a lab setup for testing how hadoop handles
hardware failures but I was thinking of the opposite on a much larger
scale.

It is though by many IT individuals that its smart to take one high
end system and split into multiple machines, though few people think
of taking multiple machines and integrating them into one and then
virtualising them apart again. This has been done though end data
storage has almost always resided on a SAN which has a cost
implication. I know Hadoops Distributed Filesystem is geared towards
batch processing and larger data sets with higher latency, though has
anyone every tried integrating it with a virtual server tech like kvm?
The other possibility is that Im nuts and have completely lost the plot.

Kind Regards

Brad

On Fri, Jun 6, 2008 at 5:03 PM, Colin Freas <co...@gmail.com> wrote:
> I've wondered about this using single or dual quad-core machines with one
> spindle per core, and partitioning them out into 2, 4, 8, whatever virtual
> machines, possibly marking each physical box as a "rack".
>
> There would be some initial and ongoing sysadmin costs.  But could this
> increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores,
> with many jobs by limiting the number of cores each job runs on, to say 8?
> Has anyone tried such a setup?
>
>
> On Fri, Jun 6, 2008 at 10:30 AM, Brad C <br...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I've been brainstorming recently and its always been in the back of my
>> mind, hadoop offers the functionality of clustering comodity systems
>> together, but how would one go about virtualising them apart again?
>>
>> Kind Regards
>>
>> Brad :)
>>
>

Re: Hadoop Distributed Virtualisation

Posted by Steve Loughran <st...@apache.org>.

Colin Freas wrote:
> I've wondered about this using single or dual quad-core machines with one
> spindle per core, and partitioning them out into 2, 4, 8, whatever virtual
> machines, possibly marking each physical box as a "rack".

Or just host VM images with multi-cpu support and make it one 'machine' 
with the appopriate # of task trackers.

> 
> There would be some initial and ongoing sysadmin costs.  But could this
> increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores,
> with many jobs by limiting the number of cores each job runs on, to say 8?
> Has anyone tried such a setup?
> 

I can see reasons for virtualisation -better isolation and security- but 
dont think performance would improve. More likely anything 
clock-sensitive will get confused if, under load, some VMs get less cpu 
time than they expect.

A better approach could be to improve the schedulers in hadoop to put 
work where it is best, maybe even move if if things are taking too long, 
etc.

Re: Hadoop Distributed Virtualisation

Posted by Colin Freas <co...@gmail.com>.

I've wondered about this using single or dual quad-core machines with one
spindle per core, and partitioning them out into 2, 4, 8, whatever virtual
machines, possibly marking each physical box as a "rack".

There would be some initial and ongoing sysadmin costs.  But could this
increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores,
with many jobs by limiting the number of cores each job runs on, to say 8?
Has anyone tried such a setup?

On Fri, Jun 6, 2008 at 10:30 AM, Brad C <br...@gmail.com> wrote:

> Hello Everyone,
>
> I've been brainstorming recently and its always been in the back of my
> mind, hadoop offers the functionality of clustering comodity systems
> together, but how would one go about virtualising them apart again?
>
> Kind Regards
>
> Brad :)
>