You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Venky Iyer (JIRA)" <ji...@apache.org> on 2008/11/05 10:23:44 UTC

[jira] Created: (HADOOP-4590) User-definable handlers for MAP and REDUCE transforms

User-definable handlers for MAP and REDUCE transforms
-----------------------------------------------------

                 Key: HADOOP-4590
                 URL: https://issues.apache.org/jira/browse/HADOOP-4590
             Project: Hadoop Core
          Issue Type: Wish
          Components: contrib/hive
            Reporter: Venky Iyer


Mappers can be specified (as before) like:

.... MAP USING 'uri' .....

uris are in a format to be decided upon; possibilities are

protocol://resource/param=value,param2=value2

or

protocol: resource_string

For example, shell commands are like 

sh://uniq or 
sh: sort | uniq

When no protocol is specified, we assume the default to be sh://.

Another example is pyfunc://foo.bar/baz=2 , which points to the bar(baz=2) function from the foo module. 

We can add handlers for these protocols like

add handler sh shell (default)
add handler pyfunc "python pyhive.py"

and replace these handlers using appropriate syntax.

Map and Reduce handlers can be distinct. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4590) User-definable handlers for MAP and REDUCE transforms

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645352#action_12645352 ] 

Zheng Shao commented on HADOOP-4590:
------------------------------------

There are several alternatives to do this:

A. Let the caller of the function provide SQL fragment (srctable, columns, where condition, etc) instead of a full SQL, then the function can construct the SQL with the additional filtering conditions. The fragments can be at different level depends on the complexity/freedom that we want to expose to the user.  If the user gives us a full SQL, we can still do post filtering by nesting  "SELECT * FROM xxx WHERE yyy".

B. Let the caller of the function provide a SQL fragment with variables inside, and the function does variable substitutions.


My main arguments against the handler approach is that there are not many cases that the library would be able to change handlers without the user being noticed.

1. The handler has to understand the row schema in order to do filtering, while that information is not available if the user gives us a full SQL.
2. The user would be able to say "SELECT" instead of MAP in the query, and the handlers won't be in effect;
3. The user would be able to nest a nested query that contains a MAP working on a totally different schema, and the intention of the library is really on changing only the outer MAP.

An extreme analogy is that we construct command line like "awk xxx | cut -f 2" by prepending/appending strings, not by setting a environment variable to ask "cut -f 2" to do the filtering.


> User-definable handlers for MAP and REDUCE transforms
> -----------------------------------------------------
>
>                 Key: HADOOP-4590
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4590
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: contrib/hive
>            Reporter: Venky Iyer
>
> Mappers can be specified (as before) like:
> .... MAP USING 'uri' .....
> uris are in a format to be decided upon; possibilities are
> protocol://resource/param=value,param2=value2
> or
> protocol: resource_string
> For example, shell commands are like 
> sh://uniq or 
> sh: sort | uniq
> When no protocol is specified, we assume the default to be sh://.
> Another example is pyfunc://foo.bar/baz=2 , which points to the bar(baz=2) function from the foo module. 
> We can add handlers for these protocols like
> add handler sh shell (default)
> add handler pyfunc "python pyhive.py"
> and replace these handlers using appropriate syntax.
> Map and Reduce handlers can be distinct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.