You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2007/03/15 01:09:09 UTC

[jira] Created: (HADOOP-1120) Contribute some code helping implement map/reduce apps for joining data from multiple sources

Contribute some code helping implement map/reduce apps for joining data from multiple sources
---------------------------------------------------------------------------------------------

                 Key: HADOOP-1120
                 URL: https://issues.apache.org/jira/browse/HADOOP-1120
             Project: Hadoop
          Issue Type: New Feature
          Components: contrib/streaming
            Reporter: Runping Qi



With the current Hadoop, it is a bit hard for the user to implement data joining apps. 
HADOOP-475/485 attempt to provide some support for data joining jobs, but it seems to be had to implement.

This Jira rather calls for a application level support. 
The idea is to provide a generic map/reduce classes implementing data join jobs, 
and allows the user to extend those classes to add their special logic. 

In particular, the user needs to define a mapper class 
that extends DataJoinMapperBase class  to implement methods for the
following functionalities:

1. Compute the source tag of input values 
2. Compute the map output value object 
3. Compute the map output key object
 
The source tag will be used by the reducer to determine from which source
(which table in SQL terminology) a value comes. Computing the map output
value object amounts to performing projecting/filtering work in a SQL
statement (through the select/where clauses). Computing the map output key
amounts to choosing the join key. This class provides the appropriate plugin
points for the user defined subclasses to implement the appropriate logic.

The the user needs to define a reducer class 
that extends DataJoinReduceBase class  to implement the following:

    protected abstract TaggedMapOutput combine(Object[] tags, Object[] values);
 
The above method is expected to produce one output value from an array of
records of different sources. The user code can also perform filtering here.
It can return null if it decides to the records do not meet certain conditions.

That is pretty much the user need to do in order to create a map/reduce job to join data 
from different sources.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1120) Contribute some code helping implement map/reduce apps for joining data from multiple sources

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi updated HADOOP-1120:
-------------------------------

    Attachment: data_join.patch

> Contribute some code helping implement map/reduce apps for joining data from multiple sources
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1120
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1120
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/streaming
>            Reporter: Runping Qi
>         Attachments: data_join.patch
>
>
> With the current Hadoop, it is a bit hard for the user to implement data joining apps. 
> HADOOP-475/485 attempt to provide some support for data joining jobs, but it seems to be had to implement.
> This Jira rather calls for a application level support. 
> The idea is to provide a generic map/reduce classes implementing data join jobs, 
> and allows the user to extend those classes to add their special logic. 
> In particular, the user needs to define a mapper class 
> that extends DataJoinMapperBase class  to implement methods for the
> following functionalities:
> 1. Compute the source tag of input values 
> 2. Compute the map output value object 
> 3. Compute the map output key object
>  
> The source tag will be used by the reducer to determine from which source
> (which table in SQL terminology) a value comes. Computing the map output
> value object amounts to performing projecting/filtering work in a SQL
> statement (through the select/where clauses). Computing the map output key
> amounts to choosing the join key. This class provides the appropriate plugin
> points for the user defined subclasses to implement the appropriate logic.
> The the user needs to define a reducer class 
> that extends DataJoinReduceBase class  to implement the following:
>     protected abstract TaggedMapOutput combine(Object[] tags, Object[] values);
>  
> The above method is expected to produce one output value from an array of
> records of different sources. The user code can also perform filtering here.
> It can return null if it decides to the records do not meet certain conditions.
> That is pretty much the user need to do in order to create a map/reduce job to join data 
> from different sources.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-1120) Contribute some code helping implement map/reduce apps for joining data from multiple sources

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi reassigned HADOOP-1120:
----------------------------------

    Assignee: Runping Qi

> Contribute some code helping implement map/reduce apps for joining data from multiple sources
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1120
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1120
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/streaming
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: data_join.patch
>
>
> With the current Hadoop, it is a bit hard for the user to implement data joining apps. 
> HADOOP-475/485 attempt to provide some support for data joining jobs, but it seems to be had to implement.
> This Jira rather calls for a application level support. 
> The idea is to provide a generic map/reduce classes implementing data join jobs, 
> and allows the user to extend those classes to add their special logic. 
> In particular, the user needs to define a mapper class 
> that extends DataJoinMapperBase class  to implement methods for the
> following functionalities:
> 1. Compute the source tag of input values 
> 2. Compute the map output value object 
> 3. Compute the map output key object
>  
> The source tag will be used by the reducer to determine from which source
> (which table in SQL terminology) a value comes. Computing the map output
> value object amounts to performing projecting/filtering work in a SQL
> statement (through the select/where clauses). Computing the map output key
> amounts to choosing the join key. This class provides the appropriate plugin
> points for the user defined subclasses to implement the appropriate logic.
> The the user needs to define a reducer class 
> that extends DataJoinReduceBase class  to implement the following:
>     protected abstract TaggedMapOutput combine(Object[] tags, Object[] values);
>  
> The above method is expected to produce one output value from an array of
> records of different sources. The user code can also perform filtering here.
> It can return null if it decides to the records do not meet certain conditions.
> That is pretty much the user need to do in order to create a map/reduce job to join data 
> from different sources.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1120) Contribute some code helping implement map/reduce apps for joining data from multiple sources

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi updated HADOOP-1120:
-------------------------------

    Status: Patch Available  (was: Open)

> Contribute some code helping implement map/reduce apps for joining data from multiple sources
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1120
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1120
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/streaming
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: data_join.patch
>
>
> With the current Hadoop, it is a bit hard for the user to implement data joining apps. 
> HADOOP-475/485 attempt to provide some support for data joining jobs, but it seems to be had to implement.
> This Jira rather calls for a application level support. 
> The idea is to provide a generic map/reduce classes implementing data join jobs, 
> and allows the user to extend those classes to add their special logic. 
> In particular, the user needs to define a mapper class 
> that extends DataJoinMapperBase class  to implement methods for the
> following functionalities:
> 1. Compute the source tag of input values 
> 2. Compute the map output value object 
> 3. Compute the map output key object
>  
> The source tag will be used by the reducer to determine from which source
> (which table in SQL terminology) a value comes. Computing the map output
> value object amounts to performing projecting/filtering work in a SQL
> statement (through the select/where clauses). Computing the map output key
> amounts to choosing the join key. This class provides the appropriate plugin
> points for the user defined subclasses to implement the appropriate logic.
> The the user needs to define a reducer class 
> that extends DataJoinReduceBase class  to implement the following:
>     protected abstract TaggedMapOutput combine(Object[] tags, Object[] values);
>  
> The above method is expected to produce one output value from an array of
> records of different sources. The user code can also perform filtering here.
> It can return null if it decides to the records do not meet certain conditions.
> That is pretty much the user need to do in order to create a map/reduce job to join data 
> from different sources.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1120) Contribute some code helping implement map/reduce apps for joining data from multiple sources

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-1120:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0
           Status: Resolved  (was: Patch Available)

Doug just committed this.  Thanks, Runping!

> Contribute some code helping implement map/reduce apps for joining data from multiple sources
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1120
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1120
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/streaming
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>             Fix For: 0.13.0
>
>         Attachments: data_join.patch
>
>
> With the current Hadoop, it is a bit hard for the user to implement data joining apps. 
> HADOOP-475/485 attempt to provide some support for data joining jobs, but it seems to be had to implement.
> This Jira rather calls for a application level support. 
> The idea is to provide a generic map/reduce classes implementing data join jobs, 
> and allows the user to extend those classes to add their special logic. 
> In particular, the user needs to define a mapper class 
> that extends DataJoinMapperBase class  to implement methods for the
> following functionalities:
> 1. Compute the source tag of input values 
> 2. Compute the map output value object 
> 3. Compute the map output key object
>  
> The source tag will be used by the reducer to determine from which source
> (which table in SQL terminology) a value comes. Computing the map output
> value object amounts to performing projecting/filtering work in a SQL
> statement (through the select/where clauses). Computing the map output key
> amounts to choosing the join key. This class provides the appropriate plugin
> points for the user defined subclasses to implement the appropriate logic.
> The the user needs to define a reducer class 
> that extends DataJoinReduceBase class  to implement the following:
>     protected abstract TaggedMapOutput combine(Object[] tags, Object[] values);
>  
> The above method is expected to produce one output value from an array of
> records of different sources. The user code can also perform filtering here.
> It can return null if it decides to the records do not meet certain conditions.
> That is pretty much the user need to do in order to create a map/reduce job to join data 
> from different sources.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.