You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Varun Gupta <th...@gmail.com> on 2009/05/13 14:49:32 UTC

Hadoop should target C++/LLVM not Java

http://www.trendcaller.com/2009/05/hadoop-should-target-cllvm-not-java.html

What's your opinion on this, people?
--
Varun Gupta

Re: Hadoop should target C++/LLVM not Java

Posted by Ted Dunning <te...@gmail.com>.
As an example and reminder for those of us who haven't worked in C++ for a
while, take this bug from hypertable:

    Defect     Accepted     Medium     ----     nuggetwheat
Rangeserver crashes if system clock is forwarded.

The rangeserver *crashes* if you change the clock?!?!

On Wed, May 13, 2009 at 10:40 AM, Ted Dunning <te...@gmail.com> wrote:

>
> The conclusion as stated was "It is just almost always worthwhile...".  I
> think we both agree that anymore that the conclusion by be "There still
> exist a few instances where it is worthwhile..".  The question is when.
>
> My take on the issue is that Hadoop would be completely moribund if it had
> been developed in C++ because it would have been non-portable and would now
> be stuck in a morass of segment faults.  Not to mention that using C++ would
> have meant that Hadoop would have had to make do without Doug C.  Java's
> virtues in terms of safety are particularly valuable in a community
> project.  Conversely, C++'s defects are particularly egregious and dangerous
> in the same setting.
>
>
> On Wed, May 13, 2009 at 8:49 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Er, isn't it right fact, conclusion that was really right then and
>> remains a little right now? it is the same reason indeed.
>>
>> On Wed, May 13, 2009 at 4:01 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Right fact (google based their map-reduce on c++), wrong conclusion.
>> >
>> > A simpler motivating factor was simply when Google did it.  In 2001 or
>> so,
>> > Java was definitely much less competitive.
>> >
>> > On Wed, May 13, 2009 at 6:18 AM, Sean Owen <sr...@gmail.com> wrote:
>> >
>> >> For reference, of course, Google operates at such a scale that they
>> >> use a C++-based MapReduce framework. It is just almost always
>> >> worthwhile to spend the time to beat Java performance.
>> >>
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
> =

Re: Hadoop should target C++/LLVM not Java

Posted by Ted Dunning <te...@gmail.com>.
The conclusion as stated was "It is just almost always worthwhile...".  I
think we both agree that anymore that the conclusion by be "There still
exist a few instances where it is worthwhile..".  The question is when.

My take on the issue is that Hadoop would be completely moribund if it had
been developed in C++ because it would have been non-portable and would now
be stuck in a morass of segment faults.  Not to mention that using C++ would
have meant that Hadoop would have had to make do without Doug C.  Java's
virtues in terms of safety are particularly valuable in a community
project.  Conversely, C++'s defects are particularly egregious and dangerous
in the same setting.

On Wed, May 13, 2009 at 8:49 AM, Sean Owen <sr...@gmail.com> wrote:

> Er, isn't it right fact, conclusion that was really right then and
> remains a little right now? it is the same reason indeed.
>
> On Wed, May 13, 2009 at 4:01 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Right fact (google based their map-reduce on c++), wrong conclusion.
> >
> > A simpler motivating factor was simply when Google did it.  In 2001 or
> so,
> > Java was definitely much less competitive.
> >
> > On Wed, May 13, 2009 at 6:18 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> For reference, of course, Google operates at such a scale that they
> >> use a C++-based MapReduce framework. It is just almost always
> >> worthwhile to spend the time to beat Java performance.
> >>
> >
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: Hadoop should target C++/LLVM not Java

Posted by Sean Owen <sr...@gmail.com>.
Er, isn't it right fact, conclusion that was really right then and
remains a little right now? it is the same reason indeed.

On Wed, May 13, 2009 at 4:01 PM, Ted Dunning <te...@gmail.com> wrote:
> Right fact (google based their map-reduce on c++), wrong conclusion.
>
> A simpler motivating factor was simply when Google did it.  In 2001 or so,
> Java was definitely much less competitive.
>
> On Wed, May 13, 2009 at 6:18 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> For reference, of course, Google operates at such a scale that they
>> use a C++-based MapReduce framework. It is just almost always
>> worthwhile to spend the time to beat Java performance.
>>
>

Re: Hadoop should target C++/LLVM not Java

Posted by Ted Dunning <te...@gmail.com>.
Right fact (google based their map-reduce on c++), wrong conclusion.

A simpler motivating factor was simply when Google did it.  In 2001 or so,
Java was definitely much less competitive.

On Wed, May 13, 2009 at 6:18 AM, Sean Owen <sr...@gmail.com> wrote:

> For reference, of course, Google operates at such a scale that they
> use a C++-based MapReduce framework. It is just almost always
> worthwhile to spend the time to beat Java performance.
>

Re: Hadoop should target C++/LLVM not Java

Posted by Sean Owen <sr...@gmail.com>.
The difference in power consumption between a fully loaded machine and
idle isn't so large (the figure 50% sticks in my head?), but the
difference between a fully loaded and half-loaded machine is quite
small. That is, if the hard disk is up, processor is at full speed,
all memory is fully powered, then using all or most is not a big deal.
Power consumption drops only if you are really idle.

I don't have numbers to back this up at my fingertips, though they're
informed by figures I've seen in the past. I think that's what one
would need to evaluate this argument, and I have a different intuition
about how much this could matter.

The main argument here seems to be, basically, that Java competes well
in wall-time performance by better parallelism and more memory usage.
Maybe, that's an interesting question. Is LLVM going to be more
efficient than Java? unclear, both have an overhead I suppose. But
again interesting question.


But, the topic really does matter. Wasting time means wasting energy,
and when we get to distributed cluster scale, it matters to the
environment. At Google they do a good job of keeping teams really
clear about how much their operations are costing -- it is staggering
sometimes. Developers who might run a big job, oops, see it fail,
start it up again, oops, wrong argument again... might think twice
when the realize how many pounds of CO2 their mistake just pumped into
the atmosphere.

(Mahout folks will now appreciate why I have been messing with the
code all over to try to micro-optimize for performance. I think there
is still not enough attention given to efficiency yet, but hey it's at
0.1.)


And, I think I agree with the conclusion of the blog post for a
different reason:

The Java/C++ performance gap for most apps is pretty negligible these
days. Why? I actually think given a fixed amount of *developer* time,
one can make a faster Java app than C++ app. Why? I can develop
faster, against a larger and more stable collection of libraries,
spend less time debugging, leaving more time to optimize the result.

But that does hit a certain plateau. Given enough developer time, I
can get native code to run faster than even JITted Java. I myself am
hard-pressed to optimize my code (Mahout - Taste) further in Java
without drastic measures.

It may take a lot of time to actually beat Java performance in C++,
but, as the scale of your operations grows, the return on that 1%
improvement you eke out grows. And of course -- when we talk about
code headed for Hadoop, we are definitely talking about large-scale
operations.

For reference, of course, Google operates at such a scale that they
use a C++-based MapReduce framework. It is just almost always
worthwhile to spend the time to beat Java performance.

This isn't going to be true of all users of distributed computing
frameworks, so it's not inherently wrong that Hadoop is in Java, but,
I did find myself saying "hmm, Java?" the first time I heard of
Hadoop.


But isn't this what this whole Hadoop streaming business is about?
letting you farm out the computation itself to whatever native process
you like and just using Hadoop for the management? because that of
course is fine.


On Wed, May 13, 2009 at 1:49 PM, Varun Gupta <th...@gmail.com> wrote:
> http://www.trendcaller.com/2009/05/hadoop-should-target-cllvm-not-java.html
>
> What's your opinion on this, people?
> --
> Varun Gupta
>

Re: Hadoop should target C++/LLVM not Java

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Wed, May 13, 2009 at 1:49 PM, Varun Gupta <th...@gmail.com> wrote:
> http://www.trendcaller.com/2009/05/hadoop-should-target-cllvm-not-java.html
>
> What's your opinion on this, people?

letting inexperience developers loose in C++ is like giving a loaded
gun to an infant: NullPointerExceptions are annoying but at least they
don't crash the server

one of the interesting developments i expect to see in the next few
years are specialist VMs (based on either Harmony or an OpenJDK fork)
which will run Java bytecode but which ship with additional libraries
backed by high performance native code. this would feel much more like
Python's easy system interfaces.

- robert

It's Java Jim,
but not as we know it,
not as we know it,
not we know it;

(with apologies to The Firm)