You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Wang Shouyan (JIRA)" <ji...@apache.org> on 2009/12/07 09:10:18 UTC

[jira] Created: (MAPREDUCE-1270) Hadoop C++ Extention

Hadoop C++ Extention
--------------------

Key: MAPREDUCE-1270
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Affects Versions: 0.20.1
Environment: hadoop linux
Reporter: Wang Shouyan

Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
1 To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we

think a new C++ extention is needed for us.
2 Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
3 It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.

What we want to do:
1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
at first, 1 and 2, then 3.

What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Dong Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891544#action_12891544 ] 

Dong Yang commented on MAPREDUCE-1270:
--------------------------------------

Here is a HADOOP-HCE-1.0.0.patch for mapreduce trunk (revision 963075), which includes Hadoop C++ Extension (short for HCE) changes to mapreduce-963075.

The steps for using this patch is as follows:
1. Download HADOOP-HCE-1.0.0.patch
2. svn co -r 963075 http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk trunk-963075; 
3. cd trunk-963075; 
4. patch -p0 < HADOOP-HCE-1.0.0.patch
5. sh build.sh (need java, forrest and ant)

HCE includes java and c++ codes, which depends on libhdfs, so in this build.sh we first check out hdfs trunk and build it.


> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.
> Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840797#action_12840797 ] 

Hong Tang commented on MAPREDUCE-1270:
--------------------------------------

bq.     The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days.

There are many hadoop devs fluent in Chinese, so it might still be a good idea to share the original design doc.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Fusheng Han (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fusheng Han updated MAPREDUCE-1270:
-----------------------------------

    Description: 
  Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

*UPDATE:*

Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
This is a full package with all hadoop source code.
Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.

Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.

Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.

Any comments are welcomed.

  was:
  Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

*UPDATE:*

Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
This is a full package with all hadoop source code.
Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.

Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.

Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.


> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.
> Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Dong Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dong Yang updated MAPREDUCE-1270:
---------------------------------

    Attachment: Overall Design of Hadoop C++ Extension.doc

Hadoop C++ Extension (HCE for short) is a framework for making mapreduce more stable and faster.
Here is the overall design of HCE, welcome to give your viewpoints on its practical implementation.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Fusheng Han (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fusheng Han updated MAPREDUCE-1270:
-----------------------------------

    Description: 
  Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

*UPDATE:*

Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
This is a full package with all hadoop source code.
Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.

Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.

Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.

  was:
  Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.


> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Luke Lu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840722#action_12840722 ] 

Luke Lu commented on MAPREDUCE-1270:
------------------------------------

Fusheng, feel free to attach the design doc if there is nothing confidential in it and Shouyan approves :). There are plenty of people on the thread who understand Chinese. It'd help me explaining some details to Arun, now that I work next to him.

On the combiner interface, I think it'd be better to add an emitValue convenient method instead of changing the interface, as there are quite a few legit uses.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Fusheng Han (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840166#action_12840166 ] 

Fusheng Han commented on MAPREDUCE-1270:
----------------------------------------

This project is undergoing inside Baidu. The basic functions have completed. We get the HCE(Hadoop C++ Extension) run fluently with Text input and without any compression. About 20 percent improvement has achieved compared to Streaming. 40GB input and 5 nodes are used in this experiment. And MapReduce application is wordcounter.

The interfaces exposed to users are similar with PIPES. Mapper interface is 
class Mapper {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t map(MapInput &input) = 0;

protected:
  virtual void emit(const void* key, const int64_t keyLength,
                const void* value, const int64_t valueLength) {
    getContext()->emit(key, keyLength, value, valueLength);
  }
  virtual TaskContext* getContext() {
    return context;
  }
};
Modeled after new hadoop MapReduce interface, setup() and cleanup() functions are added here. MapInput is a new defined type for map input. Key and value can be retrieved from this object. An emit() function is provided here, which can be invoked directly in map() function. Types of key and value are all raw memory pointer followed by corresponding length. This is better for non-text manipulation.

The Reducer is same with Mapper:
class Reducer {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t reduce(ReduceInput &input) = 0;
  
protected:
  virtual void emit(const void* key, const int64_t keyLength,
                const void* value, const int64_t valueLength) {
    getContext()->emit(key, keyLength, value, valueLength);
  } 
  virtual TaskContext* getContext() {
    return context;
  }
};
A slightly difference is that ReduceInput can get iterative values with next() function.

In hadoop MapReduce, interface of Combiner has no difference from Reduce. Here we get a little change that Combiner can only emit value (no key parameter in emit function). The consideration that omitting key from emit pair of combine function is due to mistaken keys may corrupt the order of the map output. The output key of emit() funtion is determined by the input.
class Combiner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t combine(ReduceInput &input) = 0;
  
protected:
  virtual void emit(const void* value, const int64_t valueLength) {
    getContext()->emit(getCombineKey(), getCombineKeyLength(), value, valueLength);
  } 
  virtual TaskContext* getContext() {
    return context;
  } 
  virtual const void* getCombineKey() {
    return combineKey;
  }
  virtual int64_t getCombineKeyLength() {
    return combineKeyLength;
  }
};

The Partitioner also gets setup() and cleanup() functions:
class Partitioner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup() {return 0;}
  virtual int partition(const void* key, const int64_t keyLength, int numOfReduces) = 0;
};

Following pipes, we add a new entry with the name "HCE" in hadoop command. Users run command like "hadoop hce XXX" to invoke HCE MapReduce.

We'd like to hear your comments.


> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891596#action_12891596 ] 

Allen Wittenauer commented on MAPREDUCE-1270:
---------------------------------------------

This patch appears to contain code from the C++ Boost library. Someone needs to do the legwork to determine the legality of the patch.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.
> Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840795#action_12840795 ] 

Arun C Murthy commented on MAPREDUCE-1270:
------------------------------------------

bq. The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days.

Thanks!

bq. For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner.

It's a reasonable argument, but I'd recommend we stay compatible with both Java Map-Reduce and Pipes by having the same interface. FYI: both Java and Pipes explicitly disallow changing of keys in the combiner in the 'contract'. If the user does go ahead and change the key the application is not guaranteed to work.

----

In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879055#action_12879055 ] 

Owen O'Malley commented on MAPREDUCE-1270:
------------------------------------------

Posting entire tarballs isn't very useful. Can you include your changes as a patch?

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.
> Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841114#action_12841114 ] 

Owen O'Malley commented on MAPREDUCE-1270:
------------------------------------------

{quote}
I don't think we need to completely compatible with pipes API
{quote}
I don't think there is enough motivation to have two different C++ APIs, so you should use the same interface. That does *not* mean that you can't change the API to be better. You can and should help make the APIs more usable and extensible.

{quote}
If we do need a C++ API , we should consider usability and extensibility more then compatibility, because I don't realize such compatibility problem is a problem for most users .
{quote}
There is a requirement to provide backwards compatibility of all of Hadoop's public APIs with the previous version. APIs and interfaces can be deprecated and then removed in a later version, but compatibility is not optional.



> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891662#action_12891662 ] 

Doug Cutting commented on MAPREDUCE-1270:
-----------------------------------------

Looks like BSD:

http://www.boost.org/LICENSE_1_0.txt

So we'd just need to append it to LICENSE.txt, noting there which files are under this license.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.
> Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Wang Shouyan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841070#action_12841070 ] 

Wang Shouyan commented on MAPREDUCE-1270:
-----------------------------------------

"In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously."

I do not agree with this opinion,  if we  need to establish standards of c++ API, I don't think we need to completely compatible with pipes API，  because I don't think  pipes API is carefully considerated,   may be for compatibility of some other code, but never been  discussed  adequately。

If we do need a  C++ API , we should consider usability and extensibility more then compatibility,  because I don't  realize  such compatibility problem is a problem for most users .

If for usability and extensibility, any  suggestion is welcome.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Dong Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dong Yang updated MAPREDUCE-1270:
---------------------------------

    Attachment: HADOOP-HCE-1.0.0.patch

HCE-1.0.0.patch for mapreduce trunk (revision 963075)

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.
> Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Wang Shouyan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879707#action_12879707 ] 

Wang Shouyan commented on MAPREDUCE-1270:
-----------------------------------------

Posting entire tarballs is just  for trial,  we will deploy it in our production environment  first , and later provide a patch for trunk.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.
> Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794299#action_12794299 ] 

Zheng Shao commented on MAPREDUCE-1270:
---------------------------------------

Any progress on this?

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840277#action_12840277 ] 

Arun C Murthy commented on MAPREDUCE-1270:
------------------------------------------

Fusheng, this is interesting.

Could you please put up a design document? There are several pieces I'm interested in understanding better:
# Changes to the framework JobTracker/TaskTracker for e.g. changes to TaskRunner
# Implications to job-submission, serialization of job-conf etc. from a C++ job-client etc.
# I do not understand why you are changing semantics for Combiner, this is incompatible with Java Map-Reduce.
# I'd expect one to implement a C++ 'context object' for mappers, reducers etc. I don't see this in your api at all?

I'm sure I'll have more comments once I see more details.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840340#action_12840340 ] 

Arun C Murthy commented on MAPREDUCE-1270:
------------------------------------------

Fusheng, thinking about this a bit more I have a suggestion to help push this through the hadoop framework in a more straight-forward manner and help this get committed:

I'd propose you guys take existing hadoop pipes, keep _all_ of its apis and implement the map-side sort, shuffle and reduce-side merge within pipes itself i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark the 'C++ data-path' as experimental and co-exist with current functionality, thus it will be far easier to get more experience with this.

Currently pipes allows one to implement a C++ RecordReader for the map and a C++ RecordWriter for the reduce. We can enhance pipes to collect the map-output, sort it in C++ and write out the IFile and index for the map-output. The reduces would do the shuffle, merge & 'reduce' call in C++ and use the existing infrastructure for the C++ recordwriter to write the outputs.

A note of caution: You will need to worry about TaskCompletionEvents i.e. events which let the reduces know the identity and location of completed maps, currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for this information - and this might be a sticky bit. As an intermediate step, one possible way around is to change ReduceTask.java to relay the TaskCompletionEvents from the java Child to the C++ reducer.

In terms of development, you could start developing on a svn branch of hadoop pipes.

Thoughts?

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786814#action_12786814 ] 

Todd Lipcon commented on MAPREDUCE-1270:
----------------------------------------

This is pretty interesting. How are you implementing TaskUmbilicalProtocol?

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Dong Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786953#action_12786953 ] 

Dong Yang commented on MAPREDUCE-1270:
--------------------------------------

1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic.
2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout.
3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling.


> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841468#action_12841468 ] 

Owen O'Malley commented on MAPREDUCE-1270:
------------------------------------------

By the way, here is an archive of the message that I sent back in Nov 07 comparing the performance of Java, pipes, and streaming.

http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg02961.html

Especially by reimplementing the sort and shuffle, you should be able to get much faster than Java. *smile*

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Dong Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786952#action_12786952 ] 

Dong Yang commented on MAPREDUCE-1270:
--------------------------------------

1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic.
2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout.
3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling.


> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "zhang.pengfei (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876988#action_12876988 ] 

zhang.pengfei commented on MAPREDUCE-1270:
------------------------------------------

Woo!~~~~~  sounds so cool!

now you want to opensource it ?

come on 

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Fusheng Han (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fusheng Han updated MAPREDUCE-1270:
-----------------------------------

    Attachment: HCE Performance Report.pdf
                HCE Tutorial.pdf
                HCE InstallMenu.pdf

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805457#action_12805457 ] 

He Yongqiang commented on MAPREDUCE-1270:
-----------------------------------------

Hi Dong / Shouyan,
Are you going to open source this? If yes, can you update the recent work? This can help others to better understand.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Posted by "Fusheng Han (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840666#action_12840666 ] 

Fusheng Han commented on MAPREDUCE-1270:
----------------------------------------

Arun, I appreciate your comments.

The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days.

For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner. In the situation that two spills with sorted records will merge into file.out (the output of map phase). The data flow is in this way:
-> two spills is read in a merged way
-> Combiner receives sorted <key, value> pairs
-> after manipulation, Combiner emits output <key, value> pairs
-> the output is directly written in file.out
If Combiner emits unrelated keys, the records in the file.out will not be fully sorted. In our interface, Combiner is not allowed to emit key and the output key is determined by the input. The sequence of records in file.out will be guaranteed.

to be continued... :)

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.