You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Hyunsik Choi (JIRA)" <ji...@apache.org> on 2014/02/14 13:10:20 UTC
[jira] [Updated] (TAJO-601) Improve distinct aggregation query processing

     [ https://issues.apache.org/jira/browse/TAJO-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyunsik Choi updated TAJO-601:
------------------------------

    Description: 
Currently, distinct aggregation queries are executed as follows:
* the first stage: it just shuffles tuples by hashing grouping keys.
* the second stage: it sorts them and executes sort aggregation.

This way executes queries including distinct aggregation functions with only two stages. But, it leads to large intermediate data during shuffle phase.

This kind of query can be rewritten as two queries:

{code:title=original query}
SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col from rel1 group by grp1, grp2;
{code}

{code:title=rewritten query}
SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
  SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, grp3) tmp1 group by grp1, grp2
) table1;
{code}

I'm expecting that this rewrite will significantly reduce the intermediate data volume and query response time in most cases.

  was:
Currently, distinct aggregation queries are executed as follows:
* the first stage: it just shuffles tuples by hashing grouping keys.
* the second stage: it sorts them and executes sort aggregation.

This way executes queries including distinct aggregation functions with only two stages. But, it leads to large intermediate data during shuffle phase.

This kind of query can be rewritten as two queries:

{code:title=original query}
SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col from rel1 group by grp1, grp2;
{code}

{code:title=rewritten query}
SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
  SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, grp3) tmp1 group by grp1, grp2;
{code}

I'm expecting that this rewrite will significantly reduce the intermediate data volume and query response time in most cases.


> Improve distinct aggregation query processing
> ---------------------------------------------
>
>                 Key: TAJO-601
>                 URL: https://issues.apache.org/jira/browse/TAJO-601
>             Project: Tajo
>          Issue Type: Improvement
>          Components: planner/optimizer
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.8-incubating
>
>
> Currently, distinct aggregation queries are executed as follows:
> * the first stage: it just shuffles tuples by hashing grouping keys.
> * the second stage: it sorts them and executes sort aggregation.
> This way executes queries including distinct aggregation functions with only two stages. But, it leads to large intermediate data during shuffle phase.
> This kind of query can be rewritten as two queries:
> {code:title=original query}
> SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col from rel1 group by grp1, grp2;
> {code}
> {code:title=rewritten query}
> SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
>   SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, grp3) tmp1 group by grp1, grp2
> ) table1;
> {code}
> I'm expecting that this rewrite will significantly reduce the intermediate data volume and query response time in most cases.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)