You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by 寒江雪 <ya...@gmail.com> on 2019/10/25 02:12:44 UTC
Re: [Proposal] Enhanced Aggregation, Support Grouping Sets, Rollup andCube to extend group by statement

the syntax i have posted is  the same as sqlserver, oracle, snowflake,
presto, and postgresql. The syntax of hive is different from them, but I
think we can be compatible with these two ways.

流金岁月 <50...@qq.com> 于2019年10月24日周四 下午5:00写道：

> useful feature.
>
>
> But I think the grammar of Hive may be a reference, it is a bit different
> from your case:
> SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )
>
>
> ref:
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "寒江雪"<ya...@gmail.com>;
> 发送时间: 2019年10月24日(星期四) 中午11:30
> 收件人: "dev"<de...@doris.apache.org>;
>
> 主题: [Proposal] Enhanced Aggregation, Support Grouping Sets, Rollup andCube
> to extend group by statement
>
>
>
> The CUBE, ROLLUP, and GROUPING SETS  extensions to SQL make querying and
> reporting easier and faster. CUBE, ROLLUP, and grouping sets produce a
> single result set that is equivalent to a UNION ALL of differently grouped
> rows.
>
> The three GROUPING functions  make aggregate calculations more efficient,
> thereby enhancing database performance, and scalability.
>
> GROUPING SETS clause
>
> GROUPING SETS syntax lets you define multiple groupings in the same
> query. GROUP
> BY computes all the groupings specified and combines them with UNION ALL.
> For example, consider the following statement:
>
> SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a),
> (b), ( ) );
>
> This statement is equivalent to:
>
> SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
> UNION
> SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
> UNION
> SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
> UNION
> SELECT null, null, SUM( c ) FROM tab1
>
> This is an example of real query：
>
> mysql> SELECT * FROM t;
> +------+------+------+
> | k1   | k2   | k3   |
> +------+------+------+
> | a    | A    |    1 |
> | a    | A    |    2 |
> | a    | B    |    1 |
> | a    | B    |    3 |
> | b    | A    |    1 |
> | b    | A    |    4 |
> | b    | B    |    1 |
> | b    | B    |    5 |
> +------+------+------+
> 8 rows in set (0.01 sec)
>
> mysql> SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ( (k1,
> k2), (k2), (k1), ( ) );
> +------+------+-----------+
> | k1   | k2   | sum(`k3`) |
> +------+------+-----------+
> | b    | B    |         6 |
> | a    | B    |         4 |
> | a    | A    |         3 |
> | b    | A    |         5 |
> | NULL | B    |        10 |
> | NULL | A    |         8 |
> | a    | NULL |         7 |
> | b    | NULL |        11 |
> | NULL | NULL |        18 |
> +------+------+-----------+
> 9 rows in set (0.06 sec)
>
> GROUPING SETS Syntax
>
> SELECT ...
> FROM ...
> [ ... ]
> GROUP BY GROUPING SETS ( groupSet [ , groupSet [ , ... ] ] )
> [ ... ]
>
> groupSet ::= { ( expr  [ , expr [ , ... ] ] )}
>
> <expr>
> Expression，column name.
>
>
> *ROLLUP clause*
>
> ROLLUP enables a SELECT statement to calculate multiple levels of subtotals
> across a specified group of dimensions. It also calculates a grand total.
> ROLLUP is a simple extension to the GROUP BY clause, so its syntax is
> extremely easy to use. The ROLLUP extension is highly efficient, adding
> minimal overhead to a query.
>
> ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
>
> ROLLUP ( e1, e2, e3, ... )
>
> This statement is equivalent to GROUPING SETS as followed：
>
> GROUPING SETS (
> ( e1, e2, e3, ... ),
> ...
> ( e1, e2 ),
> ( e1 ),
> ( )
> )
>
>  ROLLUP Syntax
>
> SELECT ...
> FROM ...
> [ ... ]
> GROUP BY ROLLUP ( expr  [ , expr [ , ... ] ] )
> [ ... ]
>
> <expr>
> Expression，column name.
>
> *CUBE clause*
>
>
> Like ROLLUP CUBE generates all the subtotals that could be calculated for a
> data cube with the specified dimensions.
>
> CUBE ( e1, e2, e3, ... )
>
> e.g. CUBE ( a, b, c ) is equivalent to GROUPING SETS as followed:
>
> GROUPING SETS (
> ( a, b, c ),
> ( a, b ),
> ( a,    c ),
> ( a       ),
> (    b, c ),
> (    b    ),
> (       c ),
> (         )
> )
>
> CUBE Syntax
>
> SELECT ...
> FROM ...
> [ ... ]
> GROUP BY CUBE ( expr  [ , expr [ , ... ] ] )
> [ ... ]
>
> <expr>
> Expression，column name.
>
> GROUPING_ID() Function
>
> GROUPING_ID describes which of a list of expressions are grouped in a row
> produced by a GROUP BY query. GROUPING_ID () returns an integer bitmap
> whose lowest N bits may be lit. A lit bit indicates the corresponding
> argument is not a grouping column for the given output row. The
> lowest-order bit corresponds to argument N, and the N-1th lowest-order
> bit corresponds
> to argument 1. If the column is a grouping column the bit is 0 else is 1.
>
> For example：
>
> mysql> select * from t;
> +------+------+------+
> | k1   | k2   | k3   |
> +------+------+------+
> | a    | A    |    1 |
> | a    | A    |    2 |
> | a    | B    |    1 |
> | a    | B    |    3 |
> | b    | A    |    1 |
> | b    | A    |    4 |
> | b    | B    |    1 |
> | b    | B    |    5 |
> +------+------+------+
>
> grouping sets result：
>
> mysql> SELECT k1, k2, GROUPING_ID(), SUM(k3) FROM t GROUP BY GROUPING
> SETS ( (k1, k2), (k2), (k1), ( ) );
> +------+------+---------------+-----------+
> | k1   | k2   | grouping_id() | sum(`k3`) |
> +------+------+---------------+-----------+
> | a    | A    |             0 |         3 |
> | a    | B    |             0 |         4 |
> | a    | NULL |             1 |         7 |
> | b    | A    |             0 |         5 |
> | b    | B    |             0 |         6 |
> | b    | NULL |             1 |        11 |
> | NULL | A    |             2 |         8 |
> | NULL | B    |             2 |        10 |
> | NULL | NULL |             3 |        18 |
> +------+------+---------------+-----------+
> 9 rows in set (0.02 sec)
>
>
> *Solution*
>
> For GROUPING SET is equivalent to the UNION of GROUP BY . So we can expand
> input rows, and run an GROUP BY on these rows。
>
> For example：
>
> SELECT a, b FROM src GROUP BY a, b GROUPING SETS ((a, b), (a), (b), ());
>
> Data in table src :
>
> 1, 2
> 3, 4
>
> Base on GROUPING SETS , we can expend the input to：
>
> 1, 2       (GROUPING_ID: a, b -> 00 -> 0)
> 1, null    (GROUPING_ID: a, null -> 01 -> 1)
> null, 2    (GROUPING_ID: null, b -> 10 -> 2)
> null, null (GROUPING_ID: null, null -> 11 -> 3)
>
> 3, 4       (GROUPING_ID: a, b -> 00 -> 0)
> 3, null    (GROUPING_ID: a, null -> 01 -> 1)
> null, 4    (GROUPING_ID: null, b -> 10 -> 2)
> null, null (GROUPING_ID: null, null -> 11 -> 3)
>
> And then use those row as input, then GROUP BY a, b, GROUPING_ID