You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Radhika Malik (JIRA)" <ji...@apache.org> on 2012/05/05 09:09:49 UTC

[jira] [Commented] (HIVE-1772) optimize join followed by a groupby

    [ https://issues.apache.org/jira/browse/HIVE-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268913#comment-13268913 ] 

Radhika Malik commented on HIVE-1772:
-------------------------------------

A group of us is trying to do this for a class project. We want to parallelize the process of JOIN followed by GROUP BY as follows-
The Map job is the same: it takes in two TableScanOperators (as well as any FilterOperators) as well as two ReduceSinkOperators.
The Reduce job, while computing the joins in the JoinOperator also groups the results and performs any aggregates. It then pushes the results directly to a FileSinkOperator without having a separate GroupByOperator.

Does anyone have suggestions on where we can get started in the code? Looking at Hive's architecture overview, it seems we want to make changes to the  Query Plan Generator in the compiler to generate different map-reduce tasks for queries that include Join followed by Group By. We are thinking of beginning with trying to modify src/ql/src/java/org/apache/hadoop/hive/ql/QueryPlan.java but weren't sure if this was the right approach. Any input on how you think we should approach this would be great!
                
> optimize join followed by a groupby
> -----------------------------------
>
>                 Key: HIVE-1772
>                 URL: https://issues.apache.org/jira/browse/HIVE-1772
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Navis
>         Attachments: HIVE-1772.1.patch
>
>
> explain SELECT x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) group by x.key;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1
>   Stage-0 is a root stage
> The above query issues 2 map-reduce jobs. 
> The first MR job performs the join, whereas the second MR performs the group by.
> Since the data is already sorted, the group by can be performed in the reducer of the join itself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira