You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@doris.apache.org by 寒江雪 <ya...@gmail.com> on 2019/10/24 03:30:52 UTC

[Proposal] Enhanced Aggregation, Support Grouping Sets, Rollup and Cube to extend group by statement

The CUBE, ROLLUP, and GROUPING SETS  extensions to SQL make querying and
reporting easier and faster. CUBE, ROLLUP, and grouping sets produce a
single result set that is equivalent to a UNION ALL of differently grouped
rows.

The three GROUPING functions  make aggregate calculations more efficient,
thereby enhancing database performance, and scalability.

GROUPING SETS clause

GROUPING SETS syntax lets you define multiple groupings in the same
query. GROUP
BY computes all the groupings specified and combines them with UNION ALL.
For example, consider the following statement:

SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a),
(b), ( ) );

This statement is equivalent to:

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
UNION
SELECT null, null, SUM( c ) FROM tab1

This is an example of real query：

mysql> SELECT * FROM t;
+------+------+------+
| k1   | k2   | k3   |
+------+------+------+
| a    | A    |    1 |
| a    | A    |    2 |
| a    | B    |    1 |
| a    | B    |    3 |
| b    | A    |    1 |
| b    | A    |    4 |
| b    | B    |    1 |
| b    | B    |    5 |
+------+------+------+
8 rows in set (0.01 sec)

mysql> SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ( (k1,
k2), (k2), (k1), ( ) );
+------+------+-----------+
| k1   | k2   | sum(`k3`) |
+------+------+-----------+
| b    | B    |         6 |
| a    | B    |         4 |
| a    | A    |         3 |
| b    | A    |         5 |
| NULL | B    |        10 |
| NULL | A    |         8 |
| a    | NULL |         7 |
| b    | NULL |        11 |
| NULL | NULL |        18 |
+------+------+-----------+
9 rows in set (0.06 sec)

GROUPING SETS Syntax

SELECT ...
FROM ...
[ ... ]
GROUP BY GROUPING SETS ( groupSet [ , groupSet [ , ... ] ] )
[ ... ]

groupSet ::= { ( expr  [ , expr [ , ... ] ] )}

<expr>
Expression，column name.


*ROLLUP clause*

ROLLUP enables a SELECT statement to calculate multiple levels of subtotals
across a specified group of dimensions. It also calculates a grand total.
ROLLUP is a simple extension to the GROUP BY clause, so its syntax is
extremely easy to use. The ROLLUP extension is highly efficient, adding
minimal overhead to a query.

ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:

ROLLUP ( e1, e2, e3, ... )

This statement is equivalent to GROUPING SETS as followed：

GROUPING SETS (
( e1, e2, e3, ... ),
...
( e1, e2 ),
( e1 ),
( )
)

 ROLLUP Syntax

SELECT ...
FROM ...
[ ... ]
GROUP BY ROLLUP ( expr  [ , expr [ , ... ] ] )
[ ... ]

<expr>
Expression，column name.

*CUBE clause*


Like ROLLUP CUBE generates all the subtotals that could be calculated for a
data cube with the specified dimensions.

CUBE ( e1, e2, e3, ... )

e.g. CUBE ( a, b, c ) is equivalent to GROUPING SETS as followed:

GROUPING SETS (
( a, b, c ),
( a, b ),
( a,    c ),
( a       ),
(    b, c ),
(    b    ),
(       c ),
(         )
)

CUBE Syntax

SELECT ...
FROM ...
[ ... ]
GROUP BY CUBE ( expr  [ , expr [ , ... ] ] )
[ ... ]

<expr>
Expression，column name.

GROUPING_ID() Function

GROUPING_ID describes which of a list of expressions are grouped in a row
produced by a GROUP BY query. GROUPING_ID () returns an integer bitmap
whose lowest N bits may be lit. A lit bit indicates the corresponding
argument is not a grouping column for the given output row. The
lowest-order bit corresponds to argument N, and the N-1th lowest-order
bit corresponds
to argument 1. If the column is a grouping column the bit is 0 else is 1.

For example：

mysql> select * from t;
+------+------+------+
| k1   | k2   | k3   |
+------+------+------+
| a    | A    |    1 |
| a    | A    |    2 |
| a    | B    |    1 |
| a    | B    |    3 |
| b    | A    |    1 |
| b    | A    |    4 |
| b    | B    |    1 |
| b    | B    |    5 |
+------+------+------+

grouping sets result：

mysql> SELECT k1, k2, GROUPING_ID(), SUM(k3) FROM t GROUP BY GROUPING
SETS ( (k1, k2), (k2), (k1), ( ) );
+------+------+---------------+-----------+
| k1   | k2   | grouping_id() | sum(`k3`) |
+------+------+---------------+-----------+
| a    | A    |             0 |         3 |
| a    | B    |             0 |         4 |
| a    | NULL |             1 |         7 |
| b    | A    |             0 |         5 |
| b    | B    |             0 |         6 |
| b    | NULL |             1 |        11 |
| NULL | A    |             2 |         8 |
| NULL | B    |             2 |        10 |
| NULL | NULL |             3 |        18 |
+------+------+---------------+-----------+
9 rows in set (0.02 sec)


*Solution*

For GROUPING SET is equivalent to the UNION of GROUP BY . So we can expand
input rows, and run an GROUP BY on these rows。

For example：

SELECT a, b FROM src GROUP BY a, b GROUPING SETS ((a, b), (a), (b), ());

Data in table src :

1, 2
3, 4

Base on GROUPING SETS , we can expend the input to：

1, 2       (GROUPING_ID: a, b -> 00 -> 0)
1, null    (GROUPING_ID: a, null -> 01 -> 1)
null, 2    (GROUPING_ID: null, b -> 10 -> 2)
null, null (GROUPING_ID: null, null -> 11 -> 3)

3, 4       (GROUPING_ID: a, b -> 00 -> 0)
3, null    (GROUPING_ID: a, null -> 01 -> 1)
null, 4    (GROUPING_ID: null, b -> 10 -> 2)
null, null (GROUPING_ID: null, null -> 11 -> 3)

And then use those row as input, then GROUP BY a, b, GROUPING_ID

Re: [Proposal] Enhanced Aggregation, Support Grouping Sets, Rollup andCube to extend group by statement

Posted by 寒江雪 <ya...@gmail.com>.

the syntax i have posted is  the same as sqlserver, oracle, snowflake,
presto, and postgresql. The syntax of hive is different from them, but I
think we can be compatible with these two ways.

流金岁月 <50...@qq.com> 于2019年10月24日周四 下午5:00写道：

> useful feature.
>
>
> But I think the grammar of Hive may be a reference, it is a bit different
> from your case:
> SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )
>
>
> ref:
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "寒江雪"<ya...@gmail.com>;
> 发送时间: 2019年10月24日(星期四) 中午11:30
> 收件人: "dev"<de...@doris.apache.org>;
>
> 主题: [Proposal] Enhanced Aggregation, Support Grouping Sets, Rollup andCube
> to extend group by statement
>
>
>
> The CUBE, ROLLUP, and GROUPING SETS  extensions to SQL make querying and
> reporting easier and faster. CUBE, ROLLUP, and grouping sets produce a
> single result set that is equivalent to a UNION ALL of differently grouped
> rows.
>
> The three GROUPING functions  make aggregate calculations more efficient,
> thereby enhancing database performance, and scalability.
>
> GROUPING SETS clause
>
> GROUPING SETS syntax lets you define multiple groupings in the same
> query. GROUP
> BY computes all the groupings specified and combines them with UNION ALL.
> For example, consider the following statement:
>
> SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a),
> (b), ( ) );
>
> This statement is equivalent to:
>
> SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
> UNION
> SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
> UNION
> SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
> UNION
> SELECT null, null, SUM( c ) FROM tab1
>
> This is an example of real query：
>
> mysql> SELECT * FROM t;
> +------+------+------+
> | k1   | k2   | k3   |
> +------+------+------+
> | a    | A    |    1 |
> | a    | A    |    2 |
> | a    | B    |    1 |
> | a    | B    |    3 |
> | b    | A    |    1 |
> | b    | A    |    4 |
> | b    | B    |    1 |
> | b    | B    |    5 |
> +------+------+------+
> 8 rows in set (0.01 sec)
>
> mysql> SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ( (k1,
> k2), (k2), (k1), ( ) );
> +------+------+-----------+
> | k1   | k2   | sum(`k3`) |
> +------+------+-----------+
> | b    | B    |         6 |
> | a    | B    |         4 |
> | a    | A    |         3 |
> | b    | A    |         5 |
> | NULL | B    |        10 |
> | NULL | A    |         8 |
> | a    | NULL |         7 |
> | b    | NULL |        11 |
> | NULL | NULL |        18 |
> +------+------+-----------+
> 9 rows in set (0.06 sec)
>
> GROUPING SETS Syntax
>
> SELECT ...
> FROM ...
> [ ... ]
> GROUP BY GROUPING SETS ( groupSet [ , groupSet [ , ... ] ] )
> [ ... ]
>
> groupSet ::= { ( expr  [ , expr [ , ... ] ] )}
>
> <expr>
> Expression，column name.
>
>
> *ROLLUP clause*
>
> ROLLUP enables a SELECT statement to calculate multiple levels of subtotals
> across a specified group of dimensions. It also calculates a grand total.
> ROLLUP is a simple extension to the GROUP BY clause, so its syntax is
> extremely easy to use. The ROLLUP extension is highly efficient, adding
> minimal overhead to a query.
>
> ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
>
> ROLLUP ( e1, e2, e3, ... )
>
> This statement is equivalent to GROUPING SETS as followed：
>
> GROUPING SETS (
> ( e1, e2, e3, ... ),
> ...
> ( e1, e2 ),
> ( e1 ),
> ( )
> )
>
>  ROLLUP Syntax
>
> SELECT ...
> FROM ...
> [ ... ]
> GROUP BY ROLLUP ( expr  [ , expr [ , ... ] ] )
> [ ... ]
>
> <expr>
> Expression，column name.
>
> *CUBE clause*
>
>
> Like ROLLUP CUBE generates all the subtotals that could be calculated for a
> data cube with the specified dimensions.
>
> CUBE ( e1, e2, e3, ... )
>
> e.g. CUBE ( a, b, c ) is equivalent to GROUPING SETS as followed:
>
> GROUPING SETS (
> ( a, b, c ),
> ( a, b ),
> ( a,    c ),
> ( a       ),
> (    b, c ),
> (    b    ),
> (       c ),
> (         )
> )
>
> CUBE Syntax
>
> SELECT ...
> FROM ...
> [ ... ]
> GROUP BY CUBE ( expr  [ , expr [ , ... ] ] )
> [ ... ]
>
> <expr>
> Expression，column name.
>
> GROUPING_ID() Function
>
> GROUPING_ID describes which of a list of expressions are grouped in a row
> produced by a GROUP BY query. GROUPING_ID () returns an integer bitmap
> whose lowest N bits may be lit. A lit bit indicates the corresponding
> argument is not a grouping column for the given output row. The
> lowest-order bit corresponds to argument N, and the N-1th lowest-order
> bit corresponds
> to argument 1. If the column is a grouping column the bit is 0 else is 1.
>
> For example：
>
> mysql> select * from t;
> +------+------+------+
> | k1   | k2   | k3   |
> +------+------+------+
> | a    | A    |    1 |
> | a    | A    |    2 |
> | a    | B    |    1 |
> | a    | B    |    3 |
> | b    | A    |    1 |
> | b    | A    |    4 |
> | b    | B    |    1 |
> | b    | B    |    5 |
> +------+------+------+
>
> grouping sets result：
>
> mysql> SELECT k1, k2, GROUPING_ID(), SUM(k3) FROM t GROUP BY GROUPING
> SETS ( (k1, k2), (k2), (k1), ( ) );
> +------+------+---------------+-----------+
> | k1   | k2   | grouping_id() | sum(`k3`) |
> +------+------+---------------+-----------+
> | a    | A    |             0 |         3 |
> | a    | B    |             0 |         4 |
> | a    | NULL |             1 |         7 |
> | b    | A    |             0 |         5 |
> | b    | B    |             0 |         6 |
> | b    | NULL |             1 |        11 |
> | NULL | A    |             2 |         8 |
> | NULL | B    |             2 |        10 |
> | NULL | NULL |             3 |        18 |
> +------+------+---------------+-----------+
> 9 rows in set (0.02 sec)
>
>
> *Solution*
>
> For GROUPING SET is equivalent to the UNION of GROUP BY . So we can expand
> input rows, and run an GROUP BY on these rows。
>
> For example：
>
> SELECT a, b FROM src GROUP BY a, b GROUPING SETS ((a, b), (a), (b), ());
>
> Data in table src :
>
> 1, 2
> 3, 4
>
> Base on GROUPING SETS , we can expend the input to：
>
> 1, 2       (GROUPING_ID: a, b -> 00 -> 0)
> 1, null    (GROUPING_ID: a, null -> 01 -> 1)
> null, 2    (GROUPING_ID: null, b -> 10 -> 2)
> null, null (GROUPING_ID: null, null -> 11 -> 3)
>
> 3, 4       (GROUPING_ID: a, b -> 00 -> 0)
> 3, null    (GROUPING_ID: a, null -> 01 -> 1)
> null, 4    (GROUPING_ID: null, b -> 10 -> 2)
> null, null (GROUPING_ID: null, null -> 11 -> 3)
>
> And then use those row as input, then GROUP BY a, b, GROUPING_ID

回复：[Proposal] Enhanced Aggregation, Support Grouping Sets, Rollup andCube to extend group by statement

Posted by 流金岁月 <50...@qq.com>.

useful feature.


But I think the grammar of Hive may be a reference, it is a bit different from your case:
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )


ref: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup




------------------ 原始邮件 ------------------
发件人: "寒江雪"<ya...@gmail.com>;
发送时间: 2019年10月24日(星期四) 中午11:30
收件人: "dev"<de...@doris.apache.org>;

主题: [Proposal] Enhanced Aggregation, Support Grouping Sets, Rollup andCube to extend group by statement



The CUBE, ROLLUP, and GROUPING SETS  extensions to SQL make querying and
reporting easier and faster. CUBE, ROLLUP, and grouping sets produce a
single result set that is equivalent to a UNION ALL of differently grouped
rows.

The three GROUPING functions  make aggregate calculations more efficient,
thereby enhancing database performance, and scalability.

GROUPING SETS clause

GROUPING SETS syntax lets you define multiple groupings in the same
query. GROUP
BY computes all the groupings specified and combines them with UNION ALL.
For example, consider the following statement:

SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a),
(b), ( ) );

This statement is equivalent to:

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
UNION
SELECT null, null, SUM( c ) FROM tab1

This is an example of real query：

mysql> SELECT * FROM t;
+------+------+------+
| k1   | k2   | k3   |
+------+------+------+
| a    | A    |    1 |
| a    | A    |    2 |
| a    | B    |    1 |
| a    | B    |    3 |
| b    | A    |    1 |
| b    | A    |    4 |
| b    | B    |    1 |
| b    | B    |    5 |
+------+------+------+
8 rows in set (0.01 sec)

mysql> SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ( (k1,
k2), (k2), (k1), ( ) );
+------+------+-----------+
| k1   | k2   | sum(`k3`) |
+------+------+-----------+
| b    | B    |         6 |
| a    | B    |         4 |
| a    | A    |         3 |
| b    | A    |         5 |
| NULL | B    |        10 |
| NULL | A    |         8 |
| a    | NULL |         7 |
| b    | NULL |        11 |
| NULL | NULL |        18 |
+------+------+-----------+
9 rows in set (0.06 sec)

GROUPING SETS Syntax

SELECT ...
FROM ...
[ ... ]
GROUP BY GROUPING SETS ( groupSet [ , groupSet [ , ... ] ] )
[ ... ]

groupSet ::= { ( expr  [ , expr [ , ... ] ] )}

<expr>
Expression，column name.


*ROLLUP clause*

ROLLUP enables a SELECT statement to calculate multiple levels of subtotals
across a specified group of dimensions. It also calculates a grand total.
ROLLUP is a simple extension to the GROUP BY clause, so its syntax is
extremely easy to use. The ROLLUP extension is highly efficient, adding
minimal overhead to a query.

ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:

ROLLUP ( e1, e2, e3, ... )

This statement is equivalent to GROUPING SETS as followed：

GROUPING SETS (
( e1, e2, e3, ... ),
...
( e1, e2 ),
( e1 ),
( )
)

 ROLLUP Syntax

SELECT ...
FROM ...
[ ... ]
GROUP BY ROLLUP ( expr  [ , expr [ , ... ] ] )
[ ... ]

<expr>
Expression，column name.

*CUBE clause*


Like ROLLUP CUBE generates all the subtotals that could be calculated for a
data cube with the specified dimensions.

CUBE ( e1, e2, e3, ... )

e.g. CUBE ( a, b, c ) is equivalent to GROUPING SETS as followed:

GROUPING SETS (
( a, b, c ),
( a, b ),
( a,    c ),
( a       ),
(    b, c ),
(    b    ),
(       c ),
(         )
)

CUBE Syntax

SELECT ...
FROM ...
[ ... ]
GROUP BY CUBE ( expr  [ , expr [ , ... ] ] )
[ ... ]

<expr>
Expression，column name.

GROUPING_ID() Function

GROUPING_ID describes which of a list of expressions are grouped in a row
produced by a GROUP BY query. GROUPING_ID () returns an integer bitmap
whose lowest N bits may be lit. A lit bit indicates the corresponding
argument is not a grouping column for the given output row. The
lowest-order bit corresponds to argument N, and the N-1th lowest-order
bit corresponds
to argument 1. If the column is a grouping column the bit is 0 else is 1.

For example：

mysql> select * from t;
+------+------+------+
| k1   | k2   | k3   |
+------+------+------+
| a    | A    |    1 |
| a    | A    |    2 |
| a    | B    |    1 |
| a    | B    |    3 |
| b    | A    |    1 |
| b    | A    |    4 |
| b    | B    |    1 |
| b    | B    |    5 |
+------+------+------+

grouping sets result：

mysql> SELECT k1, k2, GROUPING_ID(), SUM(k3) FROM t GROUP BY GROUPING
SETS ( (k1, k2), (k2), (k1), ( ) );
+------+------+---------------+-----------+
| k1   | k2   | grouping_id() | sum(`k3`) |
+------+------+---------------+-----------+
| a    | A    |             0 |         3 |
| a    | B    |             0 |         4 |
| a    | NULL |             1 |         7 |
| b    | A    |             0 |         5 |
| b    | B    |             0 |         6 |
| b    | NULL |             1 |        11 |
| NULL | A    |             2 |         8 |
| NULL | B    |             2 |        10 |
| NULL | NULL |             3 |        18 |
+------+------+---------------+-----------+
9 rows in set (0.02 sec)


*Solution*

For GROUPING SET is equivalent to the UNION of GROUP BY . So we can expand
input rows, and run an GROUP BY on these rows。

For example：

SELECT a, b FROM src GROUP BY a, b GROUPING SETS ((a, b), (a), (b), ());

Data in table src :

1, 2
3, 4

Base on GROUPING SETS , we can expend the input to：

1, 2       (GROUPING_ID: a, b -> 00 -> 0)
1, null    (GROUPING_ID: a, null -> 01 -> 1)
null, 2    (GROUPING_ID: null, b -> 10 -> 2)
null, null (GROUPING_ID: null, null -> 11 -> 3)

3, 4       (GROUPING_ID: a, b -> 00 -> 0)
3, null    (GROUPING_ID: a, null -> 01 -> 1)
null, 4    (GROUPING_ID: null, b -> 10 -> 2)
null, null (GROUPING_ID: null, null -> 11 -> 3)

And then use those row as input, then GROUP BY a, b, GROUPING_ID