You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tajo.apache.org by "Hyunsik Choi (JIRA)" <ji...@apache.org> on 2014/02/18 13:02:20 UTC

[jira] [Commented] (TAJO-601) Improve distinct aggregation query processing

    [ https://issues.apache.org/jira/browse/TAJO-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903986#comment-13903986 ] 

Hyunsik Choi commented on TAJO-601:
-----------------------------------

Created a review request against branch master in reviewboard 


> Improve distinct aggregation query processing
> ---------------------------------------------
>
>                 Key: TAJO-601
>                 URL: https://issues.apache.org/jira/browse/TAJO-601
>             Project: Tajo
>          Issue Type: Improvement
>          Components: planner/optimizer
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.8-incubating
>
>         Attachments: TAJO-601.patch
>
>
> Currently, distinct aggregation queries are executed as follows:
> * the first stage: it just shuffles tuples by hashing grouping keys.
> * the second stage: it sorts them and executes sort aggregation.
> This way executes queries including distinct aggregation functions with only two stages. But, it leads to large intermediate data during shuffle phase.
> This kind of query can be rewritten as two queries:
> {code:title=original query}
> SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col from rel1 group by grp1, grp2;
> {code}
> {code:title=rewritten query}
> SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
>   SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, grp3) tmp1 group by grp1, grp2
> ) table1;
> {code}
> I'm expecting that this rewrite will significantly reduce the intermediate data volume and query response time in most cases.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)