You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2019/10/22 11:22:05 UTC

[GitHub] [incubator-doris] yangzhg opened a new issue #2039: Support Grouping Sets, Rollup and Cube to extend group by statement

yangzhg opened a new issue #2039: Support Grouping Sets, Rollup and Cube to extend group by statement
URL: https://github.com/apache/incubator-doris/issues/2039
 
 
   ## 1. GROUPING SETS Background
   
   The `CUBE`, `ROLLUP`, and `GROUPING` `SETS` extensions to SQL make querying and reporting easier and faster. `CUBE`, `ROLLUP`, and grouping sets produce a single result set that is equivalent to a `UNION` `ALL` of differently grouped rows. `ROLLUP` calculates aggregations such as `SUM`, `COUNT`, `MAX`, `MIN`, and `AVG` at increasing levels of aggregation, from the most detailed up to a grand total. `CUBE` is an extension similar to `ROLLUP`, enabling a single statement to calculate all possible combinations of aggregations. The `CUBE`, `ROLLUP`, and the `GROUPING` `SETS` extension lets you specify just the groupings needed in the `GROUP` `BY` clause. This allows efficient analysis across multiple dimensions without performing a `CUBE` operation. Computing a `CUBE` creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance.
   To enhance performance, `CUBE`, `ROLLUP`, and `GROUPING SETS` can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
   
   The three `GROUPING` functions help you identify the group each row belongs to and enable sorting subtotal rows and filtering results.
   
   ### 1.1 GROUPING SETS Syntax
   
   `GROUPING SETS` syntax lets you define multiple groupings in the same query. `GROUP BY` computes all the groupings specified and combines them with `UNION ALL`. For example, consider the following statement:
   
   ```
   SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a), (b), ( ) );
   ```
   
   This statement is equivalent to:
   
   ```
   SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
   UNION
   SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
   UNION
   SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
   UNION
   SELECT null, null, SUM( c ) FROM tab1
   ```
   
   This is an example of real query:
   
   ```
   mysql> SELECT * FROM t;
   +------+------+------+
   | k1   | k2   | k3   |
   +------+------+------+
   | a    | A    |    1 |
   | a    | A    |    2 |
   | a    | B    |    1 |
   | a    | B    |    3 |
   | b    | A    |    1 |
   | b    | A    |    4 |
   | b    | B    |    1 |
   | b    | B    |    5 |
   +------+------+------+
   8 rows in set (0.01 sec)
   
   mysql> SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ( (k1, k2), (k2), (k1), ( ) );
   +------+------+-----------+
   | k1   | k2   | sum(`k3`) |
   +------+------+-----------+
   | b    | B    |         6 |
   | a    | B    |         4 |
   | a    | A    |         3 |
   | b    | A    |         5 |
   | NULL | B    |        10 |
   | NULL | A    |         8 |
   | a    | NULL |         7 |
   | b    | NULL |        11 |
   | NULL | NULL |        18 |
   +------+------+-----------+
   9 rows in set (0.06 sec)
   ```
   
   ### 1.2 ROLLUP Syntax
   
   `ROLLUP` enables a `SELECT` statement to calculate multiple levels of subtotals across a specified group of dimensions. It also calculates a grand total. `ROLLUP` is a simple extension to the `GROUP` `BY` clause, so its syntax is extremely easy to use. The `ROLLUP` extension is highly efficient, adding minimal overhead to a query.
   
   `ROLLUP` appears in the `GROUP` `BY` clause in a `SELECT` statement. Its form is:
   
   ```
   ROLLUP ( e1, e2, e3, ... )
   ```
   
   This statement is equivalent to GROUPING SETS as followed:
   
   ```
   GROUPING SETS (
   ( e1, e2, e3, ... ),
   ...
   ( e1, e2 ),
   ( e1 ),
   ( )
   )
   ```
   
   ### 1.3 CUBE Syntax
   
   Like `ROLLUP`   `CUBE` generates all the subtotals that could be calculated for a data cube with the specified dimensions.
   
   ```
   CUBE ( e1, e2, e3, ... )
   ```
   
   e.g.  CUBE ( a, b, c )  is equivalent to GROUPING SETS as followed:
   
   ```
   GROUPING SETS (
   ( a, b, c ),
   ( a, b ),
   ( a,    c ),
   ( a       ),
   (    b, c ),
   (    b    ),
   (       c ),
   (         )
   )
   ```
   
   ### 1.4 GROUPING_ID() Function
   
   `GROUPING_ID` describes which of a list of expressions are grouped in a row produced by a `GROUP BY` query. The `GROUPING_ID` function simply returns the decimal equivalent of the binary value formed as a result of the concatenation of the values returned by the `GROUPING` functions.
   
   Each `GROUPING_ID` argument must be an element of the `GROUP BY` list. `GROUPING_ID ()` returns an **integer** bitmap whose lowest N bits may be lit. A lit **bit** indicates the corresponding argument is not a grouping column for the given output row. The lowest-order **bit** corresponds to argument N, and the N-1th lowest-order **bit** corresponds to argument 1. If the column is a grouping column the bit is 0 else is 1.
   
   For example:
   
   ```
   mysql> select * from t;
   +------+------+------+
   | k1   | k2   | k3   |
   +------+------+------+
   | a    | A    |    1 |
   | a    | A    |    2 |
   | a    | B    |    1 |
   | a    | B    |    3 |
   | b    | A    |    1 |
   | b    | A    |    4 |
   | b    | B    |    1 |
   | b    | B    |    5 |
   +------+------+------+
   ```
   
   grouping sets result:
   
   ```
   mysql> SELECT k1, k2, GROUPING_ID(), SUM(k3) FROM t GROUP BY GROUPING SETS ( (k1, k2), (k2), (k1), ( ) );
   +------+------+---------------+-----------+
   | k1   | k2   | grouping_id() | sum(`k3`) |
   +------+------+---------------+-----------+
   | a    | A    |             0 |         3 |
   | a    | B    |             0 |         4 |
   | a    | NULL |             1 |         7 |
   | b    | A    |             0 |         5 |
   | b    | B    |             0 |         6 |
   | b    | NULL |             1 |        11 |
   | NULL | A    |             2 |         8 |
   | NULL | B    |             2 |        10 |
   | NULL | NULL |             3 |        18 |
   +------+------+---------------+-----------+
   9 rows in set (0.02 sec)
   
   ```
   
   ### 1.5 GROUPING Function
   
   Indicates whether a specified column expression in a `GROUP BY` list is aggregated or not. `GROUPING `returns 1 for aggregated or 0 for not aggregated in the result set. `GROUPING` can be used only in the `SELECT` list, `HAVING`, and `ORDER BY` clauses when `GROUP BY` is specified.
   
   ## 2. Object
   
   Support `GROUPING SETS`, `ROLLUP` and `CUBE ` syntax,impliments 1.1, 1.2, 1.3 1.4, 1.5
   
   ### 2.1 GROUPING SETS Syntax
   
   ```
   SELECT ...
   FROM ...
   [ ... ]
   GROUP BY GROUPING SETS ( groupSet [ , groupSet [ , ... ] ] )
   [ ... ]
   
   groupSet ::= { ( expr  [ , expr [ , ... ] ] )}
   
   <expr>
   Expression,column name.
   ```
   
   ### 2.2 ROLLUP Syntax
   
   ```
   SELECT ...
   FROM ...
   [ ... ]
   GROUP BY ROLLUP ( expr  [ , expr [ , ... ] ] )
   [ ... ]
   
   <expr>
   Expression,column name.
   ```
   
   ### 2.3 CUBE Syntax
   
   ```
   SELECT ...
   FROM ...
   [ ... ]
   GROUP BY CUBE ( expr  [ , expr [ , ... ] ] )
   [ ... ]
   
   <expr>
   Expression,column name.
   ```
   
   ## 3. Implementation
   
   ### 3.1 Overall Design Approaches
   
   For `GROUPING SET`  is equivalent to the `UNION` of  `GROUP BY` .  So we can expand input rows, and run an  GROUP BY on these rows。
   
   For example:
   
   ```
   SELECT a, b FROM src GROUP BY a, b GROUPING SETS ((a, b), (a), (b), ());
   ```
   
   Data in  table src :
   
   ```
   1, 2
   3, 4
   ```
   
   Base on  GROUPING SETS , we can expend the input to:
   
   ```
   1, 2       (GROUPING_ID: a, b -> 00 -> 0)
   1, null    (GROUPING_ID: a, null -> 01 -> 1)
   null, 2    (GROUPING_ID: null, b -> 10 -> 2)
   null, null (GROUPING_ID: null, null -> 11 -> 3)
   
   3, 4       (GROUPING_ID: a, b -> 00 -> 0)
   3, null    (GROUPING_ID: a, null -> 01 -> 1)
   null, 4    (GROUPING_ID: null, b -> 10 -> 2)
   null, null (GROUPING_ID: null, null -> 11 -> 3)
   ```
   
   And then use those row as input, then GROUP BY  a, b, GROUPING_ID
   
   ### 3.2 Example
   
   Table t:
   
   ```
   mysql> select * from t;
   +------+------+------+
   | k1   | k2   | k3   |
   +------+------+------+
   | a    | A    |    1 |
   | a    | A    |    2 |
   | a    | B    |    1 |
   | a    | B    |    3 |
   | b    | A    |    1 |
   | b    | A    |    4 |
   | b    | B    |    1 |
   | b    | B    |    5 |
   +------+------+------+
   8 rows in set (0.01 sec)
   ```
   
   for the query:
   
   ```
   SELECT k1, k2, GROUPING_ID(), SUM(k3) FROM t GROUP BY GROUPING SETS ((k1, k2), (k1), (k2), ());
   ```
   
   First,expand the input,every row expand into 4 rows ( the size of GROUPING SETS), and insert GROUPING_ID column
   
   e.g.  a, A, 1 expanded to:
   
   ```
   +------+------+------+---------------+
   | k1   | k2   | k3   | GROUPING_ID() |
   +------+------+------+---------------+
   | a    | A    |    1 |             0 |
   | a    | NULL |    1 |             1 |
   | NULL | A    |    1 |             2 |
   | NULL | NULL |    1 |             3 |
   +------+------+------+---------------+
   ```
   
   Finally, all rows expended as follows (32 rows):
   
   ```
   +------+------+------+---------------+
   | k1   | k2   | k3   | GROUPING_ID() |
   +------+------+------+---------------+
   | a    | A    |    1 |             0 |
   | a    | A    |    2 |             0 |
   | a    | B    |    1 |             0 |
   | a    | B    |    3 |             0 |
   | b    | A    |    1 |             0 |
   | b    | A    |    4 |             0 |
   | b    | B    |    1 |             0 |
   | b    | B    |    5 |             0 |
   | a    | NULL |    1 |             1 |
   | a    | NULL |    1 |             1 |
   | a    | NULL |    2 |             1 |
   | a    | NULL |    3 |             1 |
   | b    | NULL |    1 |             1 |
   | b    | NULL |    1 |             1 |
   | b    | NULL |    4 |             1 |
   | b    | NULL |    5 |             1 |
   | NULL | A    |    1 |             2 |
   | NULL | A    |    1 |             2 |
   | NULL | A    |    2 |             2 |
   | NULL | A    |    4 |             2 |
   | NULL | B    |    1 |             2 |
   | NULL | B    |    1 |             2 |
   | NULL | B    |    3 |             2 |
   | NULL | B    |    5 |             2 |
   | NULL | NULL |    1 |             3 |
   | NULL | NULL |    1 |             3 |
   | NULL | NULL |    1 |             3 |
   | NULL | NULL |    1 |             3 |
   | NULL | NULL |    2 |             3 |
   | NULL | NULL |    3 |             3 |
   | NULL | NULL |    4 |             3 |
   | NULL | NULL |    5 |             3 |
   +------+------+------+---------------+
   32 rows in set.
   ```
   
   now GROUP BY k1, k2, GROUPING_ID():
   
   ```
   +------+------+---------------+-----------+
   | k1   | k2   | grouping_id() | sum(`k3`) |
   +------+------+---------------+-----------+
   | a    | A    |             0 |         3 |
   | a    | B    |             0 |         4 |
   | a    | NULL |             1 |         7 |
   | b    | A    |             0 |         5 |
   | b    | B    |             0 |         6 |
   | b    | NULL |             1 |        11 |
   | NULL | A    |             2 |         8 |
   | NULL | B    |             2 |        10 |
   | NULL | NULL |             3 |        18 |
   +------+------+---------------+-----------+
   9 rows in set (0.02 sec)
   ```
   
   The result is equivalent to the UNION ALL
   
   ```
   select k1, k2, sum(k3) from t group by k1, k2
   UNION ALL
   select NULL, k2, sum(k3) from t group by k2
   UNION ALL
   select k1, NULL, sum(k3) from t group by k1
   UNION ALL
   select NULL, NULL, sum(k3) from t;
   
   +------+------+-----------+
   | k1   | k2   | sum(`k3`) |
   +------+------+-----------+
   | b    | B    |         6 |
   | b    | A    |         5 |
   | a    | A    |         3 |
   | a    | B    |         4 |
   | a    | NULL |         7 |
   | b    | NULL |        11 |
   | NULL | B    |        10 |
   | NULL | A    |         8 |
   | NULL | NULL |        18 |
   +------+------+-----------+
   9 rows in set (0.06 sec)
   ```
   
   ### 3.3 FE 
   
   #### 3.3.1 Tasks
   
   1. Add GroupByClause, repalce groupingExprs.
   2. Add Grouping Sets, Cube and RollUp syntax.
   3. Add GroupByClause in SelectStmt.
   4. add virtual column GROUPING\_\_ID,and insert into groupingExprs,
   5. Add a PlanNode, name as RepeatNode. For GroupingSets aggregation insert RepeatNode to the plan.
   
   #### 3.3.2 Tuple
   
   In order to add GROUPING_ID to groupingExprs in GroupByClause, need to create virtual SlotRef, also, need tot create a tuple for this slot, named GROUPING\_\_ID Tuple.
   
   For the plannode RepeatNode, it's input is all the  tuple of it's children, It's output tuple is the repeat data and GROUPING_ID.
   
   
   #### 3.3.3 Expression and Function Substitution
   
   expr -> if(bitand(pos, grouping_id)=0, expr, null) for expr in extension grouping clause
   grouping_id() -> grouping_id(grouping_id) for grouping_id function
   
   ### 3.4 BE
   
   #### 3.4.1 Tasks
   
   1. Add RepeatNode executor, expend the input data and append GROUPING_ID to every row
   2. Implements grouping_id() function.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org