You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2008/11/01 00:23:44 UTC

[jira] Commented: (HADOOP-4569) Hive: new syntax for specifying custom map/reduce scripts

    [ https://issues.apache.org/jira/browse/HADOOP-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644461#action_12644461 ] 

Zheng Shao commented on HADOOP-4569:
------------------------------------

The old syntax for doing that was:

    FROM (
        FROM pv_users 
        SELECT TRANSFORM(pv_users.userid, pv_users.date)
        AS(key, value) 
        USING 'map_script' 
        CLUSTER BY key ) map_output 
    INSERT OVERWRITE TABLE pv_users_reduced
        SELECT TRANSFORM(map_output.key, map_output.value) 
        AS (date, count)
        USING 'reduce_script'; 

We plan to change that to:
    FROM (
        FROM pv_users 
        MAP pv_users.userid, pv_users.date
        USING 'map_script' 
        AS key, value
        CLUSTER BY key
        ) map_output 
    INSERT OVERWRITE TABLE pv_users_reduced
        REDUCE map_output.key, map_output.value
        USING 'reduce_script'
        AS date, count;


The script is expected to read tab-separated fields, and also generate tab-separated fields.


The major changes are:
•         Schemaless Mapper/Reducer: if there is "AS" we assume "AS key,value" which takes the bytes before the first tab into key, and the rest to value.
•         SELECT TRANSFORM changed to MAP/REDUCE to make it clear what is map and what is reduce.
•         Reordered USING and AS to make it clearer.
*         Support different shuffling/sorting keys by using "DISTRIBUTE BY" and "SORT BY" ("CLUSTER BY key" means "DISTRIBUTE BY key SORT BY key ASC")


> Hive: new syntax for specifying custom map/reduce scripts
> ---------------------------------------------------------
>
>                 Key: HADOOP-4569
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4569
>             Project: Hadoop Core
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> In Hive we not only supports SQL but also want to support custom scripts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.