You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2014/06/24 10:31:29 UTC
[jira] [Comment Edited] (MAPREDUCE-2841) Task level native optimization

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041816#comment-14041816 ] 

Arun C Murthy edited comment on MAPREDUCE-2841 at 6/24/14 8:30 AM:
-------------------------------------------------------------------

Todd,

bq. I agree that building a completely parallel C++ MR runtime is a much larger project that should not be part of Hadoop. 

I'm confused. There already exists large amounts of code on the github for a the full task runtime. Is that abandoned? Are you saying there no intention to contribute that to Hadoop, ever? Why would that be? Would that be a separate project?

With or without ABI, C++ still is a major problem w.r.t different compiler versions, different platforms we support etc. That is precisely why HADOOP-10388 chose to use pure-C only. A similar switch makes me *much* more comfortable, aside from the disparity in skills in the Hadoop community. 

Furthermore, there are considerably more security issues which open up in C++ land such as buffer overflow etc.

----

bq. I think the 75k you're counting may include the auto-generated shell scripts.

>From the github:

{noformat}
$ find . -name *.java | xargs wc -l
   11988 total
$ find . -name *.h | xargs wc -l
   27269 total
$ find . -name *.cc | xargs wc -l
   26276 total
{noformat}

Whether it's test or non-test, we are still importing a *lot* of code - code for which the Hadoop community does need to maintain?

----

bq. So, it's not a tiny import by any means, but for 2x improvement on terasort wallclock, my opinion is that the maintenance burden is worth it.

Todd, as we both know, there are many, many ways to get 2x improvement on terasort...
... nor is it worth a lot in real-world outside of benchmarks. 

I'm sure we both would take 2x on Pig/Hive anyday... *smile*

----

bq. As for importing to Tez, I don't think the community has generally agreed to EOL MapReduce

Regardless of whether or not we pull this into MR, it would be useful to pull it into Tez too - if Sean wants to do it. Let's not discourage them.

I'm sure we both agree, and want to see real world workloads improve and that Hive/Pig/Cascading etc. represent that.

IAC, hopefully we can stop this meme that I'm trying to *preclude* you from doing anything regardless of my beliefs. IAC, we both realize MR is reasonably stable and won't get a lot of investment, and so do our employers:
http://vision.cloudera.com/mapreduce-spark/
http://hortonworks.com/hadoop/tez/

Essentially, you asked for feedback from the MapReduce community; and this is my honest feedback - as someone who has actively helped maintain this codebase for more than 8 years now. So, I'd appreciate if we don't misinterpret each others' technical opinions and concerns during this discussion. Thanks in advance.

FTR: I'll restate my concerns about C++, roadmap for C++ runtime, maintainability, support for all of Hadoop (new security bugs, future security features, platforms etc.). 

Furthermore, this jira was opened nearly 3 years ago and only has sporadic bursts of activity - not a good sign for long-term maintainability.

I've stated my concerns, let's try get through them by focussing on those aspects.

----

Finally, what is the concern you see with starting this as an incubator project and allowing folks to develop a community around it? We can certainly help on our end by making it easy for them to plug in via interfaces etc. 

Thanks.


was (Author: acmurthy):
Todd,

bq. I agree that building a completely parallel C++ MR runtime is a much larger project that should not be part of Hadoop. 

I'm confused. There already exists large amounts of code on the github for a the full task runtime. Is that abandoned? Are you saying there no intention to contribute that to Hadoop, ever? Why would that be? Would that be a separate project?

With or without ABI, C++ still is a major problem w.r.t different compiler versions, different platforms we support etc. That is precisely why HADOOP-10388 chose to use pure-C only. A similar switch makes me *much* more comfortable, aside from the disparity in skills in the Hadoop community. 

Furthermore, there are considerably more security issues which open up in C++ land such as buffer overflow etc.

----

bq. I think the 75k you're counting may include the auto-generated shell scripts.

>From the github:

{noformat}
$ find . -name *.java | xargs wc -l
   11988 total
$ find . -name *.h | xargs wc -l
   27269 total
$ find . -name *.cc | xargs wc -l
   26276 total
{noformat}

Whether it's test or non-test, we are still importing a *lot* of code - code for which the Hadoop community does need to maintain?

----

bq. So, it's not a tiny import by any means, but for 2x improvement on terasort wallclock, my opinion is that the maintenance burden is worth it.

Todd, as we both know, there are many, many ways to get 2x improvement on terasort...
... nor is it worth a lot in real-world outside of benchmarks. 

I'm sure we both would take 2x on Pig/Hive anyday... *smile*

----

bq. As for importing to Tez, I don't think the community has generally agreed to EOL MapReduce

Regardless of whether or not we pull this into MR, it would be useful to pull it into Tez too - if Sean wants to do it. Let's not discourage them.

I'm sure we both agree, and want to see real world workloads improve and that Hive/Pig/Cascading etc. represent that.

IAC, hopefully we can stop this meme that I'm trying to *preclude* you from doing anything regardless of my religious beliefs. IAC, we both realize MR is reasonably stable and won't get a lot of investment, and so do our employers:
http://vision.cloudera.com/mapreduce-spark/
http://hortonworks.com/hadoop/tez/

So, I'd appreciate if we don't misinterpret each others' technical opinions and concerns during this discussion. Thanks.

FTR: I'll restate my concerns about C++, roadmap for C++ runtime, maintainability, support for all of Hadoop (security, platforms etc.). 

Furthermore, this jira was opened nearly 3 years ago and only has sporadic bursts of activity - not a good sign for long-term maintainability.

I've stated my concerns, let's try get through them by focussing on those aspects.

----

Finally, what is the concern you see with starting this as an incubator project and allowing folks to develop a community around it? We can certainly help on our end by making it easy for them to plug in via interfaces etc. 

Thanks.

> Task level native optimization
> ------------------------------
>
>                 Key: MAPREDUCE-2841
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>         Environment: x86-64 Linux/Unix
>            Reporter: Binglin Chang
>            Assignee: Sean Zhong
>         Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs emitted by mapper, therefore sort, spill, IFile serialization can all be done in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is supported, and I have not think through many things right now, such as how to support map side combine. I had some discussion with somebody familiar with hive, it seems that these limitations won't be much problem for Hive to benefit from those optimizations, at least. Advices or discussions about improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), which checks if key/value type, comparator type, combiner are all compatible, then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better final results, and I believe similar optimization can be adopt to reduce task and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)