You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/10 12:20:00 UTC

[GitHub] [arrow-datafusion] thinkharderdev opened a new pull request, #2716: Support for GROUPING SETS/CUBE/ROLLUP

thinkharderdev opened a new pull request, #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1327
   
   TODO
   
   - [ ] Implement CUBE expansion
   - [ ] Implement ROLLUP expansion
   - [ ] Add SQL tests for CUBE/ROLLUP queries
   - [ ] Pass partitioning expressions directly to `AggregateExec`
   
   Note that currently the sql parser doesn't seem to handle `GROUP BY GROUPING SETS ...` so we need to address that to test that explicitly. 
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   This PR adds support for GROUPING SETS (and special cases CUBE/ROLLUP) in the physical planner and execution plan.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   There are three primary changes:
   
   1. `AggregateExec` now takes a `Vec<Vec<(Arc<dyn PhysicalExpr>,String)>>` to represent grouping sets. A normal `GROUP BY` is just a special case. We expect the grouping sets to be "aligned". For example, for a SQL clause like `GROUP BY GROUPING SETS ((a),(b),(a,b))`, `AggregateExec` assumes that the planner will expand that to the grouping set `((a,NULL),(NULL,b),(a,b))`. We can't handle this in the execution plan because we don't have `ParialEq` for `PhysicalExpr`.
   2. In `DefaultPhysicalPlanner` handle expanding and aligning grouping sets. This includes expanding CUBE/ROLLUP expressions and merging and aligning GROUPING SET expressions. 
   3. Handle grouping sets correctly in optimizers. 
   
   Also we include serialization for grouping set expression in `datafusion-proto`  
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   SQL statements with CUBE/ROLLUP should now be supported. GROUPING SETS should also be supported but it seems like the sql parser is not handling them correctly. 
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   I don't think so. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#issuecomment-1152635501

   Thanks @thinkharderdev  -- I'll try and find some time to review this over the weekend. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#issuecomment-1152302881

   cc @alamb @tustvold @Jimexist @yjshen @andygrove 
   
   The part of this that I am least confident about is that I didn't break anything in any of the optimizers :). So if someone familiar with that code can review that part I would be very grateful.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
alamb merged PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895161026


##########
datafusion/expr/src/expr.rs:
##########
@@ -270,6 +270,27 @@ pub enum GroupingSet {
     GroupingSets(Vec<Vec<Expr>>),
 }
 
+impl GroupingSet {
+    /// Return all distinct exprs in the grouping set. For `CUBE` and `ROLLUP` this
+    /// is just the underlying list of exprs. For `GROUPING SET` we need to deduplicate
+    /// the exprs in the underlying sets.
+    pub fn distinct_expr(&self) -> Vec<Expr> {
+        match self {
+            GroupingSet::Rollup(exprs) => exprs.clone(),
+            GroupingSet::Cube(exprs) => exprs.clone(),
+            GroupingSet::GroupingSets(groups) => {
+                let mut exprs: Vec<Expr> = vec![];
+                for exp in groups.iter().flatten() {
+                    if !exprs.contains(exp) {

Review Comment:
   Yeah, this is unfortunate 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895576855


##########
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##########
@@ -250,6 +280,29 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn single_distinct_and_grouping_set() -> Result<()> {

Review Comment:
   I think disabling the optimization for grouping sets is a wise idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895151112


##########
datafusion/expr/src/expr_fn.rs:
##########
@@ -226,6 +227,21 @@ pub fn scalar_subquery(subquery: Arc<LogicalPlan>) -> Expr {
     Expr::ScalarSubquery(Subquery { subquery })
 }
 
+/// Create a grouping set
+pub fn grouping_set(exprs: Vec<Vec<Expr>>) -> Expr {
+    Expr::GroupingSet(GroupingSet::GroupingSets(exprs))
+}
+
+/// Create a grouping set

Review Comment:
   ```suggestion
   /// Create a grouping set for all combination of `exprs`
   ```



##########
datafusion/core/tests/sql/aggregates.rs:
##########
@@ -476,6 +476,205 @@ async fn csv_query_approx_percentile_cont() -> Result<()> {
     Ok(())
 }
 
+#[tokio::test]
+async fn csv_query_cube_avg() -> Result<()> {
+    let ctx = SessionContext::new();
+    register_aggregate_csv_by_sql(&ctx).await;
+
+    let sql = "SELECT c1, c2, AVG(c3) FROM aggregate_test_100 GROUP BY CUBE (c1, c2) ORDER BY c1, c2";
+    let actual = execute_to_batches(&ctx, sql).await;
+    let expected = vec![
+        "+----+----+----------------------------+",
+        "| c1 | c2 | AVG(aggregate_test_100.c3) |",
+        "+----+----+----------------------------+",
+        "| a  | 1  | -17.6                      |",
+        "| a  | 2  | -15.333333333333334        |",
+        "| a  | 3  | -4.5                       |",
+        "| a  | 4  | -32                        |",
+        "| a  | 5  | -32                        |",
+        "| a  |    | -18.333333333333332        |",
+        "| b  | 1  | 31.666666666666668         |",
+        "| b  | 2  | 25.5                       |",
+        "| b  | 3  | -42                        |",
+        "| b  | 4  | -44.6                      |",
+        "| b  | 5  | -0.2                       |",
+        "| b  |    | -5.842105263157895         |",
+        "| c  | 1  | 47.5                       |",
+        "| c  | 2  | -55.57142857142857         |",
+        "| c  | 3  | 47.5                       |",
+        "| c  | 4  | -10.75                     |",
+        "| c  | 5  | 12                         |",
+        "| c  |    | -1.3333333333333333        |",
+        "| d  | 1  | -8.142857142857142         |",
+        "| d  | 2  | 109.33333333333333         |",
+        "| d  | 3  | 41.333333333333336         |",
+        "| d  | 4  | 54                         |",
+        "| d  | 5  | -49.5                      |",
+        "| d  |    | 25.444444444444443         |",
+        "| e  | 1  | 75.66666666666667          |",
+        "| e  | 2  | 37.8                       |",
+        "| e  | 3  | 48                         |",
+        "| e  | 4  | 37.285714285714285         |",
+        "| e  | 5  | -11                        |",
+        "| e  |    | 40.333333333333336         |",
+        "|    | 1  | 16.681818181818183         |",
+        "|    | 2  | 8.363636363636363          |",
+        "|    | 3  | 20.789473684210527         |",
+        "|    | 4  | 1.2608695652173914         |",
+        "|    | 5  | -13.857142857142858        |",
+        "|    |    | 7.81                       |",
+        "+----+----+----------------------------+",
+    ];
+    assert_batches_eq!(expected, &actual);
+    Ok(())
+}
+
+#[tokio::test]
+async fn csv_query_rollup_avg() -> Result<()> {
+    let ctx = SessionContext::new();
+    register_aggregate_csv_by_sql(&ctx).await;
+
+    let sql = "SELECT c1, c2, c3, AVG(c4) FROM aggregate_test_100 GROUP BY ROLLUP (c1, c2, c3) ORDER BY c1, c2, c3";
+    let actual = execute_to_batches(&ctx, sql).await;
+    let expected = vec![
+        "+----+----+------+----------------------------+",
+        "| c1 | c2 | c3   | AVG(aggregate_test_100.c4) |",
+        "+----+----+------+----------------------------+",
+        "| a  | 1  | -85  | -15154                     |",
+        "| a  | 1  | -56  | 8692                       |",
+        "| a  | 1  | -25  | 15295                      |",
+        "| a  | 1  | -5   | 12636                      |",
+        "| a  | 1  | 83   | -14704                     |",
+        "| a  | 1  |      | 1353                       |",
+        "| a  | 2  | -48  | -18025                     |",
+        "| a  | 2  | -43  | 13080                      |",
+        "| a  | 2  | 45   | 15673                      |",
+        "| a  | 2  |      | 3576                       |",
+        "| a  | 3  | -72  | -11122                     |",
+        "| a  | 3  | -12  | -9168                      |",
+        "| a  | 3  | 13   | 22338.5                    |",
+        "| a  | 3  | 14   | 28162                      |",
+        "| a  | 3  | 17   | -22796                     |",
+        "| a  | 3  |      | 4958.833333333333          |",
+        "| a  | 4  | -101 | 11640                      |",
+        "| a  | 4  | -54  | -2376                      |",
+        "| a  | 4  | -38  | 20744                      |",
+        "| a  | 4  | 65   | -28462                     |",
+        "| a  | 4  |      | 386.5                      |",
+        "| a  | 5  | -101 | -12484                     |",
+        "| a  | 5  | -31  | -12907                     |",
+        "| a  | 5  | 36   | -16974                     |",
+        "| a  | 5  |      | -14121.666666666666        |",
+        "| a  |    |      | 306.04761904761904         |",
+        "| b  | 1  | 12   | 7652                       |",
+        "| b  | 1  | 29   | -18218                     |",
+        "| b  | 1  | 54   | -18410                     |",
+        "| b  | 1  |      | -9658.666666666666         |",
+        "| b  | 2  | -60  | -21739                     |",
+        "| b  | 2  | 31   | 23127                      |",
+        "| b  | 2  | 63   | 21456                      |",
+        "| b  | 2  | 68   | 15874                      |",
+        "| b  | 2  |      | 9679.5                     |",
+        "| b  | 3  | -101 | -13217                     |",
+        "| b  | 3  | 17   | 14457                      |",
+        "| b  | 3  |      | 620                        |",
+        "| b  | 4  | -117 | 19316                      |",
+        "| b  | 4  | -111 | -1967                      |",
+        "| b  | 4  | -59  | 25286                      |",
+        "| b  | 4  | 17   | -28070                     |",
+        "| b  | 4  | 47   | 20690                      |",
+        "| b  | 4  |      | 7051                       |",
+        "| b  | 5  | -82  | 22080                      |",
+        "| b  | 5  | -44  | 15788                      |",
+        "| b  | 5  | -5   | 24896                      |",
+        "| b  | 5  | 62   | 16337                      |",
+        "| b  | 5  | 68   | 21576                      |",
+        "| b  | 5  |      | 20135.4                    |",
+        "| b  |    |      | 7732.315789473684          |",
+        "| c  | 1  | -24  | -24085                     |",
+        "| c  | 1  | 41   | -4667                      |",
+        "| c  | 1  | 70   | 27752                      |",
+        "| c  | 1  | 103  | -22186                     |",
+        "| c  | 1  |      | -5796.5                    |",
+        "| c  | 2  | -117 | -30187                     |",
+        "| c  | 2  | -107 | -2904                      |",
+        "| c  | 2  | -106 | -1114                      |",
+        "| c  | 2  | -60  | -16312                     |",
+        "| c  | 2  | -29  | 25305                      |",
+        "| c  | 2  | 1    | 18109                      |",
+        "| c  | 2  | 29   | -3855                      |",
+        "| c  | 2  |      | -1565.4285714285713        |",
+        "| c  | 3  | -2   | -18655                     |",
+        "| c  | 3  | 22   | 13741                      |",
+        "| c  | 3  | 73   | -9565                      |",
+        "| c  | 3  | 97   | 29106                      |",
+        "| c  | 3  |      | 3656.75                    |",
+        "| c  | 4  | -90  | -2935                      |",
+        "| c  | 4  | -79  | 5281                       |",
+        "| c  | 4  | 3    | -30508                     |",
+        "| c  | 4  | 123  | 16620                      |",
+        "| c  | 4  |      | -2885.5                    |",
+        "| c  | 5  | -94  | -15880                     |",
+        "| c  | 5  | 118  | 19208                      |",
+        "| c  | 5  |      | 1664                       |",
+        "| c  |    |      | -1320.5238095238096        |",
+        "| d  | 1  | -99  | 5613                       |",
+        "| d  | 1  | -98  | 13630                      |",
+        "| d  | 1  | -72  | 25590                      |",
+        "| d  | 1  | -8   | 27138                      |",
+        "| d  | 1  | 38   | 18384                      |",
+        "| d  | 1  | 57   | 28781                      |",
+        "| d  | 1  | 125  | 31106                      |",
+        "| d  | 1  |      | 21463.14285714286          |",
+        "| d  | 2  | 93   | -12642                     |",
+        "| d  | 2  | 113  | 3917                       |",
+        "| d  | 2  | 122  | 10130                      |",
+        "| d  | 2  |      | 468.3333333333333          |",
+        "| d  | 3  | -76  | 8809                       |",
+        "| d  | 3  | 77   | 15091                      |",
+        "| d  | 3  | 123  | 29533                      |",
+        "| d  | 3  |      | 17811                      |",
+        "| d  | 4  | 5    | -7688                      |",
+        "| d  | 4  | 55   | -1471                      |",
+        "| d  | 4  | 102  | -24558                     |",
+        "| d  | 4  |      | -11239                     |",
+        "| d  | 5  | -59  | 2045                       |",
+        "| d  | 5  | -40  | 22614                      |",
+        "| d  | 5  |      | 12329.5                    |",
+        "| d  |    |      | 10890.111111111111         |",
+        "| e  | 1  | 36   | -21481                     |",
+        "| e  | 1  | 71   | -5479                      |",
+        "| e  | 1  | 120  | 10837                      |",
+        "| e  | 1  |      | -5374.333333333333         |",
+        "| e  | 2  | -61  | -2888                      |",
+        "| e  | 2  | 49   | 24495                      |",
+        "| e  | 2  | 52   | 5666                       |",
+        "| e  | 2  | 97   | 18167                      |",
+        "| e  | 2  |      | 10221.2                    |",
+        "| e  | 3  | -95  | 13611                      |",
+        "| e  | 3  | 71   | 194                        |",
+        "| e  | 3  | 104  | -25136                     |",
+        "| e  | 3  | 112  | -6823                      |",
+        "| e  | 3  |      | -4538.5                    |",
+        "| e  | 4  | -56  | -31500                     |",
+        "| e  | 4  | -53  | 13788                      |",
+        "| e  | 4  | 30   | -16110                     |",
+        "| e  | 4  | 73   | -22501                     |",
+        "| e  | 4  | 74   | -12612                     |",
+        "| e  | 4  | 96   | -30336                     |",
+        "| e  | 4  | 97   | -13181                     |",
+        "| e  | 4  |      | -16064.57142857143         |",
+        "| e  | 5  | -86  | 32514                      |",
+        "| e  | 5  | 64   | -26526                     |",
+        "| e  | 5  |      | 2994                       |",

Review Comment:
   πŸ‘ 



##########
datafusion/core/src/lib.rs:
##########
@@ -204,6 +204,7 @@
 /// DataFusion crate version
 pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");
 
+extern crate core;

Review Comment:
   Why is this necessary?



##########
datafusion/expr/src/expr_fn.rs:
##########
@@ -226,6 +227,21 @@ pub fn scalar_subquery(subquery: Arc<LogicalPlan>) -> Expr {
     Expr::ScalarSubquery(Subquery { subquery })
 }
 
+/// Create a grouping set
+pub fn grouping_set(exprs: Vec<Vec<Expr>>) -> Expr {
+    Expr::GroupingSet(GroupingSet::GroupingSets(exprs))
+}
+
+/// Create a grouping set
+pub fn cube(exprs: Vec<Expr>) -> Expr {
+    Expr::GroupingSet(GroupingSet::Cube(exprs))
+}
+
+/// Create a grouping set

Review Comment:
   ```suggestion
   /// Create a grouping set for rollup 
   ```



##########
datafusion/expr/src/expr.rs:
##########
@@ -270,6 +270,27 @@ pub enum GroupingSet {
     GroupingSets(Vec<Vec<Expr>>),
 }
 
+impl GroupingSet {
+    /// Return all distinct exprs in the grouping set. For `CUBE` and `ROLLUP` this
+    /// is just the underlying list of exprs. For `GROUPING SET` we need to deduplicate
+    /// the exprs in the underlying sets.
+    pub fn distinct_expr(&self) -> Vec<Expr> {
+        match self {
+            GroupingSet::Rollup(exprs) => exprs.clone(),
+            GroupingSet::Cube(exprs) => exprs.clone(),
+            GroupingSet::GroupingSets(groups) => {
+                let mut exprs: Vec<Expr> = vec![];
+                for exp in groups.iter().flatten() {
+                    if !exprs.contains(exp) {

Review Comment:
   This is N^2 in the number of grouping sets -- probably not an issue, I just figured I would point it out



##########
datafusion/core/src/physical_plan/aggregates/mod.rs:
##########
@@ -90,18 +121,24 @@ impl AggregateExec {
     /// Create a new hash aggregate execution plan
     pub fn try_new(
         mode: AggregateMode,
-        group_expr: Vec<(Arc<dyn PhysicalExpr>, String)>,
+        grouping_set: PhysicalGroupingSet,
         aggr_expr: Vec<Arc<dyn AggregateExpr>>,
         input: Arc<dyn ExecutionPlan>,
         input_schema: SchemaRef,
     ) -> Result<Self> {
-        let schema = create_schema(&input.schema(), &group_expr, &aggr_expr, mode)?;
+        let schema = create_schema(
+            &input.schema(),
+            &grouping_set.expr,
+            &aggr_expr,
+            grouping_set.groups.iter().flatten().any(|is_null| *is_null),

Review Comment:
   I wonder if extracting this code to a function such as  `GroupingSets::contains_null()` might make the code easier to read. The same comment applies to other places where `GroupingSets::groups` is referenced as well. 
   
   Given the size of this PR already, definitely could be done as a follow on



##########
datafusion/expr/src/utils.rs:
##########
@@ -45,6 +45,22 @@ pub fn exprlist_to_columns(expr: &[Expr], accum: &mut HashSet<Column>) -> Result
     Ok(())
 }
 
+/// Find all distinct exprs in a list of group by expressions. We assume that if the
+/// first element is a `GroupingSet` expression then that is the only expr.

Review Comment:
   ```suggestion
   /// Find all distinct exprs in a list of group by expressions. If the
   /// first element is a `GroupingSet` expression then it must be the only expr.
   ```



##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -110,12 +111,15 @@ impl GroupedHashAggregateStreamV2 {
         // The expressions to evaluate the batch, one vec of expressions per aggregation.
         // Assume create_schema() always put group columns in front of aggr columns, we set
         // col_idx_base to group expression count.
-        let aggregate_expressions =
-            aggregates::aggregate_expressions(&aggr_expr, &mode, group_expr.len())?;
+        let aggregate_expressions = aggregates::aggregate_expressions(

Review Comment:
   FYI @yjshen  -- it would be really nice to try and consolidate `row_hash` and `hash` -- filed https://github.com/apache/arrow-datafusion/issues/2723 to track πŸ‘ 



##########
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##########
@@ -250,6 +280,29 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn single_distinct_and_grouping_set() -> Result<()> {

Review Comment:
   Given there is special handling for CUBE and ROLLUP in this pass, I suggest test coverage for those cases too



##########
datafusion/core/src/physical_plan/aggregates/mod.rs:
##########
@@ -65,13 +66,43 @@ pub enum AggregateMode {
     FinalPartitioned,
 }
 
+/// Represents a GROUPING SET in the physical plan

Review Comment:
   πŸ‘ 
   Thank you -- this is well commented and clear to understand
   
   ```suggestion
   /// Represents `GROUP BY` clause in the plan plan (including the more general GROUPING SET)
   ```
   
   I was thinking that many people encountering SQL will not be familiar with GROUPING SET but would be very familiar with GROUP BY -- what would you think about calling this structure `PhysicalGroupBy` or something to make the connection to `GROUPING` clearer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895160864


##########
datafusion/core/src/lib.rs:
##########
@@ -204,6 +204,7 @@
 /// DataFusion crate version
 pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");
 
+extern crate core;

Review Comment:
   It's not :) Not sure where it came from but removed now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895632227


##########
datafusion/core/src/physical_plan/aggregates/mod.rs:
##########
@@ -117,14 +171,16 @@ impl AggregateExec {
 
     /// Grouping expressions
     pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] {
-        &self.group_expr
+        // TODO Is this right?

Review Comment:
   In this case it would still be correct right? The aggregate stats are only used if there is no group by which this would still represent correctly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895630565


##########
datafusion/core/src/physical_plan/planner.rs:
##########
@@ -574,11 +569,12 @@ impl DefaultPhysicalPlanner {
 
                     // TODO: dictionary type not yet supported in Hash Repartition
                     let contains_dict = groups
+                        .expr

Review Comment:
   Yeah, I think that is a good idea. Fixed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895630565


##########
datafusion/core/src/physical_plan/planner.rs:
##########
@@ -574,11 +569,12 @@ impl DefaultPhysicalPlanner {
 
                     // TODO: dictionary type not yet supported in Hash Repartition
                     let contains_dict = groups
+                        .expr

Review Comment:
   Yeah, I think that is a food idea. Fixed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#issuecomment-1152921401

   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2716?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2716](https://codecov.io/gh/apache/arrow-datafusion/pull/2716?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5eb1881) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/080c32400ddfa2d45b5bebb820184eac8fd5a03a?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (080c324) will **increase** coverage by `0.15%`.
   > The diff coverage is `94.82%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2716      +/-   ##
   ==========================================
   + Coverage   84.72%   84.88%   +0.15%     
   ==========================================
     Files         270      270              
     Lines       47254    47675     +421     
   ==========================================
   + Hits        40036    40468     +432     
   + Misses       7218     7207      -11     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Ξ” | |
   |---|---|---|
   | [datafusion/expr/src/expr\_fn.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9leHByX2ZuLnJz) | `90.58% <66.66%> (-0.88%)` | :arrow_down: |
   | [datafusion/expr/src/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy91dGlscy5ycw==) | `90.80% <71.42%> (-0.39%)` | :arrow_down: |
   | [datafusion/proto/src/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9wcm90by9zcmMvZnJvbV9wcm90by5ycw==) | `34.64% <81.25%> (+0.85%)` | :arrow_up: |
   | [...fusion/optimizer/src/single\_distinct\_to\_groupby.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9vcHRpbWl6ZXIvc3JjL3NpbmdsZV9kaXN0aW5jdF90b19ncm91cGJ5LnJz) | `96.25% <88.88%> (-2.28%)` | :arrow_down: |
   | [...atafusion/core/src/physical\_plan/aggregates/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMvbW9kLnJz) | `93.13% <90.35%> (-1.46%)` | :arrow_down: |
   | [datafusion/core/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `80.72% <96.00%> (+2.50%)` | :arrow_up: |
   | [datafusion/core/tests/dataframe.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL2RhdGFmcmFtZS5ycw==) | `98.62% <97.18%> (-1.38%)` | :arrow_down: |
   | [datafusion/core/tests/sql/aggregates.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL3NxbC9hZ2dyZWdhdGVzLnJz) | `99.25% <97.67%> (-0.11%)` | :arrow_down: |
   | [...tafusion/core/src/physical\_plan/aggregates/hash.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMvaGFzaC5ycw==) | `92.95% <100.00%> (+0.04%)` | :arrow_up: |
   | [...sion/core/src/physical\_plan/aggregates/row\_hash.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMvcm93X2hhc2gucnM=) | `95.83% <100.00%> (+0.07%)` | :arrow_up: |
   | ... and [12 more](https://codecov.io/gh/apache/arrow-datafusion/pull/2716/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2716?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Ξ” = absolute <relative> (impact)`, `ΓΈ = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2716?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [080c324...5eb1881](https://codecov.io/gh/apache/arrow-datafusion/pull/2716?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895667430


##########
datafusion/core/src/physical_plan/aggregates/mod.rs:
##########
@@ -117,14 +171,16 @@ impl AggregateExec {
 
     /// Grouping expressions
     pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] {
-        &self.group_expr
+        // TODO Is this right?

Review Comment:
   Returning `&PhysicalGroupBy` sounds like a good future proof idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895633371


##########
datafusion/core/src/physical_plan/aggregates/mod.rs:
##########
@@ -117,14 +171,16 @@ impl AggregateExec {
 
     /// Grouping expressions
     pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] {
-        &self.group_expr
+        // TODO Is this right?

Review Comment:
   Or maybe this should just return `&PhysicalGroupBy` instead? I could see how this could lead to issues elsewhere if it is used for optimizations. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895577319


##########
datafusion/core/src/physical_optimizer/aggregate_statistics.rs:
##########
@@ -265,7 +265,7 @@ mod tests {
 
     use crate::error::Result;
     use crate::logical_plan::Operator;
-    use crate::physical_plan::aggregates::AggregateExec;
+    use crate::physical_plan::aggregates::{AggregateExec, PhysicalGroupBy};

Review Comment:
   Love the new name `PhysicalGroupBy`



##########
datafusion/core/src/physical_plan/aggregates/mod.rs:
##########
@@ -65,13 +66,60 @@ pub enum AggregateMode {
     FinalPartitioned,
 }
 
+/// Represents `GROUP BY` clause in the plan (including the more general GROUPING SET)

Review Comment:
   Thank you -- this is super helpful



##########
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##########
@@ -212,6 +224,9 @@ mod tests {
         let optimized_plan = rule
             .optimize(plan, &OptimizerConfig::new())
             .expect("failed to optimize plan");
+
+        println!("{:?}", optimized_plan);

Review Comment:
   left over?



##########
datafusion/core/src/physical_plan/planner.rs:
##########
@@ -1346,6 +1616,74 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn test_create_cube_expr() -> Result<()> {
+        let logical_plan = test_csv_scan()
+            .await?
+            .project(vec![col("c1"), col("c2"), col("c3")])?
+            .aggregate(
+                vec![cube(vec![col("c1"), col("c2"), col("c3")])],
+                vec![sum(col("c2"))],
+            )?

Review Comment:
   I don't understand the need to creating the `aggregate` on the logical plan (as then new cube expressions are planned below). Can you simply use the output of the `project` plan?
   
   The same question applies to the other plans below



##########
datafusion/core/src/physical_plan/aggregates/mod.rs:
##########
@@ -117,14 +171,16 @@ impl AggregateExec {
 
     /// Grouping expressions
     pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] {
-        &self.group_expr
+        // TODO Is this right?

Review Comment:
   I don't think so -- this seems to be used by the "use statistics instead of aggregates" optimization
   
   ```
   /Users/alamb/Software/arrow-datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
   113:             && final_agg_exec.group_expr().is_empty()
   121:                         && partial_agg_exec.group_expr().is_empty()
   /Users/alamb/Software/arrow-datafusion/datafusion/core/src/physical_plan/aggregates/mod.rs
   728:         let groups = partial_aggregate.group_expr().to_vec();
   ```
   
   In general, it might make sense to disable / skip all such optimizations in the cases of grouping sets / cube / rollup -- that would be the conservative approach and avoid potential subtle wrong answer bugs. As the feature is used more and people have a need to optimize it more, we can revisit the optimizations and make sure they are relevant to grouping sets
   



##########
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##########
@@ -62,9 +63,11 @@ fn optimize(plan: &LogicalPlan) -> Result<LogicalPlan> {
             schema,
             group_expr,
         }) => {
-            if is_single_distinct_agg(plan) {
+            if is_single_distinct_agg(plan) && !contains_grouping_set(group_expr) {

Review Comment:
   I think this is a good idea -- to skip grouping sets in optimizations



##########
datafusion/core/src/physical_plan/planner.rs:
##########
@@ -574,11 +569,12 @@ impl DefaultPhysicalPlanner {
 
                     // TODO: dictionary type not yet supported in Hash Repartition
                     let contains_dict = groups
+                        .expr

Review Comment:
   I think it is a minor thing, but one might imagine keeping the fields of `PhysicalGroupBy` private and adding functions like `fn expr()` and `fn is_empty()` mostly as a way of additional documentation



##########
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##########
@@ -160,6 +166,7 @@ fn optimize_children(plan: &LogicalPlan) -> Result<LogicalPlan> {
 }
 
 fn is_single_distinct_agg(plan: &LogicalPlan) -> bool {
+    // false

Review Comment:
   Left over?



##########
datafusion/core/tests/sql/aggregates.rs:
##########
@@ -1223,6 +1669,79 @@ async fn count_aggregated() -> Result<()> {
     Ok(())
 }
 
+#[tokio::test]
+async fn count_aggregated_cube() -> Result<()> {
+    let results = execute_with_partition(
+        "SELECT c1, c2, COUNT(c3) FROM test GROUP BY CUBE (c1, c2) ORDER BY c1, c2",
+        4,
+    )
+    .await?;
+
+    let expected = vec![
+        "+----+----+----------------+",
+        "| c1 | c2 | COUNT(test.c3) |",
+        "+----+----+----------------+",
+        "|    |    | 40             |",
+        "|    | 1  | 4              |",
+        "|    | 10 | 4              |",
+        "|    | 2  | 4              |",
+        "|    | 3  | 4              |",
+        "|    | 4  | 4              |",
+        "|    | 5  | 4              |",
+        "|    | 6  | 4              |",
+        "|    | 7  | 4              |",
+        "|    | 8  | 4              |",
+        "|    | 9  | 4              |",
+        "| 0  |    | 10             |",
+        "| 0  | 1  | 1              |",
+        "| 0  | 10 | 1              |",
+        "| 0  | 2  | 1              |",
+        "| 0  | 3  | 1              |",
+        "| 0  | 4  | 1              |",
+        "| 0  | 5  | 1              |",
+        "| 0  | 6  | 1              |",
+        "| 0  | 7  | 1              |",
+        "| 0  | 8  | 1              |",
+        "| 0  | 9  | 1              |",
+        "| 1  |    | 10             |",
+        "| 1  | 1  | 1              |",
+        "| 1  | 10 | 1              |",
+        "| 1  | 2  | 1              |",
+        "| 1  | 3  | 1              |",
+        "| 1  | 4  | 1              |",
+        "| 1  | 5  | 1              |",
+        "| 1  | 6  | 1              |",
+        "| 1  | 7  | 1              |",
+        "| 1  | 8  | 1              |",
+        "| 1  | 9  | 1              |",
+        "| 2  |    | 10             |",
+        "| 2  | 1  | 1              |",
+        "| 2  | 10 | 1              |",
+        "| 2  | 2  | 1              |",
+        "| 2  | 3  | 1              |",
+        "| 2  | 4  | 1              |",
+        "| 2  | 5  | 1              |",
+        "| 2  | 6  | 1              |",
+        "| 2  | 7  | 1              |",
+        "| 2  | 8  | 1              |",
+        "| 2  | 9  | 1              |",
+        "| 3  |    | 10             |",
+        "| 3  | 1  | 1              |",
+        "| 3  | 10 | 1              |",
+        "| 3  | 2  | 1              |",
+        "| 3  | 3  | 1              |",
+        "| 3  | 4  | 1              |",
+        "| 3  | 5  | 1              |",
+        "| 3  | 6  | 1              |",
+        "| 3  | 7  | 1              |",
+        "| 3  | 8  | 1              |",
+        "| 3  | 9  | 1              |",
+        "+----+----+----------------+",
+    ];
+    assert_batches_sorted_eq!(expected, &results);
+    Ok(())
+}
+

Review Comment:
   I think the test coverage is quite good. Thank you



##########
datafusion/core/src/physical_plan/planner.rs:
##########
@@ -1001,6 +1001,275 @@ impl DefaultPhysicalPlanner {
             exec_plan
         }.boxed()
     }
+
+    fn create_grouping_physical_expr(
+        &self,
+        group_expr: &[Expr],
+        input_dfschema: &DFSchema,
+        input_schema: &Schema,
+        session_state: &SessionState,
+    ) -> Result<PhysicalGroupBy> {
+        if group_expr.len() == 1 {
+            match &group_expr[0] {
+                Expr::GroupingSet(GroupingSet::GroupingSets(grouping_sets)) => {
+                    merge_grouping_set_physical_expr(
+                        grouping_sets,
+                        input_dfschema,
+                        input_schema,
+                        session_state,
+                    )
+                }
+                Expr::GroupingSet(GroupingSet::Cube(exprs)) => create_cube_physical_expr(
+                    exprs,
+                    input_dfschema,
+                    input_schema,
+                    session_state,
+                ),
+                Expr::GroupingSet(GroupingSet::Rollup(exprs)) => {
+                    create_rollup_physical_expr(
+                        exprs,
+                        input_dfschema,
+                        input_schema,
+                        session_state,
+                    )
+                }
+                expr => Ok(PhysicalGroupBy::new_single(vec![tuple_err((
+                    self.create_physical_expr(
+                        expr,
+                        input_dfschema,
+                        input_schema,
+                        session_state,
+                    ),
+                    physical_name(expr),
+                ))?])),
+            }
+        } else {
+            Ok(PhysicalGroupBy::new_single(
+                group_expr
+                    .iter()
+                    .map(|e| {
+                        tuple_err((
+                            self.create_physical_expr(
+                                e,
+                                input_dfschema,
+                                input_schema,
+                                session_state,
+                            ),
+                            physical_name(e),
+                        ))
+                    })
+                    .collect::<Result<Vec<_>>>()?,
+            ))
+        }
+    }
+}
+
+/// Expand and align  a GROUPING SET expression.
+/// (see https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS)
+///
+/// This will take a list of grouping sets and ensure that each group is
+/// properly aligned for the physical execution plan. We do this by
+/// identifying all unique expression in each group and conforming each
+/// group to the same set of expression types and ordering.
+/// For example, if we have something like `GROUPING SETS ((a,b,c),(a),(b),(b,c))`
+/// we would expand this to `GROUPING SETS ((a,b,c),(a,NULL,NULL),(NULL,b,NULL),(NULL,b,c))
+/// (see https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS)
+fn merge_grouping_set_physical_expr(
+    grouping_sets: &[Vec<Expr>],
+    input_dfschema: &DFSchema,
+    input_schema: &Schema,
+    session_state: &SessionState,
+) -> Result<PhysicalGroupBy> {
+    let num_groups = grouping_sets.len();
+    let mut all_exprs: Vec<Expr> = vec![];
+    let mut grouping_set_expr: Vec<(Arc<dyn PhysicalExpr>, String)> = vec![];
+    let mut null_exprs: Vec<(Arc<dyn PhysicalExpr>, String)> = vec![];
+
+    for expr in grouping_sets.iter().flatten() {
+        if !all_exprs.contains(expr) {
+            all_exprs.push(expr.clone());
+
+            grouping_set_expr.push(get_physical_expr_pair(
+                expr,
+                input_dfschema,
+                input_schema,
+                session_state,
+            )?);
+
+            null_exprs.push(get_null_physical_expr_pair(
+                expr,
+                input_dfschema,
+                input_schema,
+                session_state,
+            )?);
+        }
+    }
+
+    let mut merged_sets: Vec<Vec<bool>> = Vec::with_capacity(num_groups);
+
+    let expr_count = all_exprs.len();
+
+    for expr_group in grouping_sets.iter() {
+        let mut group: Vec<bool> = Vec::with_capacity(expr_count);
+        for expr in all_exprs.iter() {
+            if expr_group.contains(expr) {
+                group.push(false);
+            } else {
+                group.push(true)
+            }
+        }

Review Comment:
   I don't think it matters, but you can probably express this in a functional style like:
   
   ```suggestion
           let group: Vec<bool> = all_exprs.iter()
             .map(expr_group.contains(expr))
             .collect();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#issuecomment-1154368725

   Looks great to me -- thanks again for all the work @thinkharderdev πŸŽ‰ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895168895


##########
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##########
@@ -250,6 +280,29 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn single_distinct_and_grouping_set() -> Result<()> {

Review Comment:
   Yeah, I think there is actually a bug in this. I'll work on a fix. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2716: Support for GROUPING SETS/CUBE/ROLLUP

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2716:
URL: https://github.com/apache/arrow-datafusion/pull/2716#discussion_r895175300


##########
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##########
@@ -250,6 +280,29 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn single_distinct_and_grouping_set() -> Result<()> {

Review Comment:
   Ok, this optimization is a bit more complicated for grouping sets. We need to create a separate alias for each group. For the moment I have just disabled the optimization for this case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org