You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Stephen Watt <sw...@us.ibm.com> on 2010/02/08 20:58:18 UTC

Hadoop on a Virtualized O/S vs. the Real O/S

Hi Folks

I need to be able to certify that Hadoop works on various operating 
systems. I do this by running a series it through a series of tests. As 
I'm sure you can empathize, obtaining all the machines for each test run 
can sometimes be tricky. It would be easier for me if I can spin up 
several instances a virtual image of the desired O/S, but to do this, I 
need to know if there are any risks I'm running using that approach.

Is there any reason why Hadoop might work differently on a virtual O/S as 
opposed to running on an actual O/S ? Since just about everything is done 
through the JVM and SSH I don't foresee any issues and I don't believe 
we're doing anything weird with device drivers or have any kernel module 
dependencies.

Kind regards
Steve Watt

RE: Hadoop on a Virtualized O/S vs. the Real O/S

Posted by Bill Habermaas <bi...@habermaas.us>.
In my shop we also did certification on different operating platforms. This
was done on virtualized machines for all the Linux variants.  We ran the
Apache hadoop unit tests in each environment and then checked the results.
Overall hadoop runs well but some of the more bizarre lunatic unit tests
will react strangely. 

You will likely see the same issues as we did...

1. Some Networking APIs behave slight differently between Linux and
Solaris/Aix environments. 
2. Windows will encounter many failed tests under cygwin and not in a
consistent manner.  Sometimes a test will work and other times it won't.  I
suspect because cvgwin is not a perfect simulation and race conditions cause
different reactions - depending on the phase of the moon. Oh well, Windows
is not for production anyway <shrug>

Bill

-----Original Message-----
From: Stephen Watt [mailto:swatt@us.ibm.com] 
Sent: Monday, February 08, 2010 2:58 PM
To: common-user@hadoop.apache.org
Subject: Hadoop on a Virtualized O/S vs. the Real O/S

Hi Folks

I need to be able to certify that Hadoop works on various operating 
systems. I do this by running a series it through a series of tests. As 
I'm sure you can empathize, obtaining all the machines for each test run 
can sometimes be tricky. It would be easier for me if I can spin up 
several instances a virtual image of the desired O/S, but to do this, I 
need to know if there are any risks I'm running using that approach.

Is there any reason why Hadoop might work differently on a virtual O/S as 
opposed to running on an actual O/S ? Since just about everything is done 
through the JVM and SSH I don't foresee any issues and I don't believe 
we're doing anything weird with device drivers or have any kernel module 
dependencies.

Kind regards
Steve Watt



Re: Hadoop on a Virtualized O/S vs. the Real O/S

Posted by Steve Loughran <st...@apache.org>.
Stephen Watt wrote:
> Hi Folks
> 
> I need to be able to certify that Hadoop works on various operating 
> systems. I do this by running a series it through a series of tests. As 
> I'm sure you can empathize, obtaining all the machines for each test run 
> can sometimes be tricky. It would be easier for me if I can spin up 
> several instances a virtual image of the desired O/S, but to do this, I 
> need to know if there are any risks I'm running using that approach.
> 
> Is there any reason why Hadoop might work differently on a virtual O/S as 
> opposed to running on an actual O/S ? Since just about everything is done 
> through the JVM and SSH I don't foresee any issues and I don't believe 
> we're doing anything weird with device drivers or have any kernel module 
> dependencies.
> 
> Kind regards
> Steve Watt

I run Hadoop on VMs

- performance can be below raw IO rates, but that's predictable
- if you bring up a private network then you have DNS/rDNS problems. 
Hadoop is happy if everything knows who it is and DNS does too. 
Otherwise: edit the hosts tables
- the big enemy on VMs is unexpected swapping out and clock drift, 
screws up anything that assumes time moves forward at roughly the same 
rate everywhere. Zookeeper assumes this, as do most distributed 
co-ordination systems. If you keep VM load low, one Virtual CPU per 
physical one, and don't overallocate physical memory, most of these 
problems go away
-set the CPU affinity for the VM so it is always bonded to the same CPU, 
using taskset or the equivalent. Minimises cache misses and other problems