You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rich Haase <rd...@gmail.com> on 2014/06/26 00:50:09 UTC
MapReduce Streaming on Solaris

Hi all,

I have a 20 node cluster that is running on Solaris x86 (OpenIndian).  I'm
not really familiar with OpenIndiana having moved from Solaris to Linux
many years ago, but it's the OS of choice for the systems administrator at
my company.

Each worker has 24 700xGB drives, 24 cores and 96 GB of memory.

All hadoop daemons are running with 2g of memory except for the jobtracker,
which has 4g.   I am only running 10 map and 8 reduce slots per cluster
node to leave lots of memory free for streaming jobs.

Yesterday I was running some performance tests after noticing that all of
my teams streaming jobs were running appallingly slow.  My test dataset is
a set of key value pairs with a heavy skew towards one key.

My test program is word count implemented in each of the test languages.

Here were the results of my test:

LanguageOSDatasetSizeRuntime/usr/bin/pythonSolaris41GB Counts of unique ids
for a day in raw text410:46:08/opt/local/bin/pythonSolaris41GB Counts of
unique ids for a day in raw text410:39:02/usr/local/bin/pythonSolaris41GB
Counts of unique ids for a day in raw text410:35:31rubySolaris41GB Counts
of unique ids for a day in raw text410:37:31perlSolaris41GB Counts of
unique ids for a day in raw text410:17:38pigSolaris41GB Counts of unique
ids for a day in raw text410:08:35javaSolaris41GB Counts of unique ids for
a day in raw text410:03:44pythonOS X4.9 GB Counts of unique ids for a day
in raw text4.90:09:07rubyOS X4.9 GB Counts of unique ids for a day in raw
text4.90:08:21perlOS X4.9 GB Counts of unique ids for a day in raw text4.9
0:06:41javaOS X4.9 GB Counts of unique ids for a day in raw text4.90:04:03

As you can see the runtime deltas between streaming and java M/R was far
worse than is general quoted in the community, and definitely worse than I
have experienced on similar clusters running on Linux.  The performance of
the pseudo-distributed cluster on my macbook is much closer to my
expectations for streaming.

My question to all of you is: has anyone run into performance problems with
streaming on Solaris?   If so, is there remedy?  Other than run Hadoop on
Linux?

Cheers,

Rich


-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."