You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Rui Li (JIRA)" <ji...@apache.org> on 2014/08/14 11:14:12 UTC

[jira] [Commented] (HIVE-7659) Unnecessary sort in query plan

    [ https://issues.apache.org/jira/browse/HIVE-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096774#comment-14096774 ] 

Rui Li commented on HIVE-7659:
------------------------------

After some research, I found the unnecessary sort is mainly introduced when we generate GBY operator. This patch ignores the sort order in RS if the partition keys, sorting keys and grouping keys are the same. Otherwise, e.g. in case of DISTINCT or data skew, we apply the sort shuffle according to the sort order so that the query can produce correct results.

> Unnecessary sort in query plan
> ------------------------------
>
>                 Key: HIVE-7659
>                 URL: https://issues.apache.org/jira/browse/HIVE-7659
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Rui Li
>            Assignee: Rui Li
>         Attachments: HIVE-7659-spark.patch
>
>
> For hive on spark.
> Currently we rely on the sort order in RS to decide whether we need a sortByKey transformation. However a simple group by query will also have the sort order set to '+'.
> Consider the query: select key from table group by key. The RS in the map work will have sort order set to '+', thus requiring a sortByKey shuffle.
> To avoid the unnecessary sort, we should either use another way to decide if there has to be a sort shuffle, or we should set the sort order only when sort is really needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)