You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/20 04:54:54 UTC

[GitHub] [spark] rxin opened a new pull request, #39134: [WIP] Implement group by star (aka group by all)

rxin opened a new pull request, #39134:
URL: https://github.com/apache/spark/pull/39134

### What changes were proposed in this pull request?
This patch implements group by star. This is similar to the "group by all" implemented in DuckDB. Note that I'm not done yet. We need to decide if the appropriate syntax is star, all, or both. We also need to decide if we want to throw a nicer error message when we cannot infer the right columns. Right now it just says invalid group by expression.

### Why are the changes needed?
It's nice convenience syntactic sugar for interactive sql to avoid repeating the grouping columns/expressions in group by, when they can be implicitly inferred. It actually brings SQL a little bit closer to the DataFrame API in terms of usability for aggregations.

### Does this PR introduce _any_ user-facing change?
Yes. This is a user facing syntactic sugar.

### How was this patch tested?
Added test cases homegrown, as well as test cases from DuckDB and Mosha.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1053728030


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_STAR}
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_STAR, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))

Review Comment:
   Created https://github.com/apache/spark/pull/39142 for this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1054899920


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   Shall we merge the rule `ResolveGroupByAll` into `ResolveReference` to avoid potential issues? 
   @cloud-fan is working on this for existing rules: https://github.com/apache/spark/pull/38888



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055051734


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   OK, to make `ResolveGroupByAll` a bit more robust, we can do
   ```
   object ResolveGroupByAll extends Rule[LogicalPlan] {
     def apply(plan: LogicalPlan) = ResolveReference(plan).transform ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052993548


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_STAR}
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_STAR, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))

Review Comment:
   @gengliangwang i learned that this doesn't handle PythonUDF aggregate properly. I think we should fix that issue instead of changing it here though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-41635][SQL] GROUP BY ALL [spark]

Posted by "cuixiaozong (via GitHub)" <gi...@apache.org>.

cuixiaozong commented on PR #39134:
URL: https://github.com/apache/spark/pull/39134#issuecomment-1794430313

   I noticed that we treat 'ALL' in 'GROUP BY ALL' as an Identifier in parsing, and this will cause problem when I turn on ansi keyword behavior
   **set spark.sql.ansi.enabled=true;
   set spark.sql.ansi.enforceReservedKeywords=true;**
   
   spark-sql (default)> select a,b,c, count(*) from values(1,2,3)t(a,b,c) group by all;
   
   [PARSE_SYNTAX_ERROR] Syntax error at or near 'all'.(line 1, pos 59)
   
   == SQL ==
   select a,b,c, count(*) from values(1,2,3)t(a,b,c) group by all
   -----------------------------------------------------------^^^
   
   can we allow this reserved keyword in ansi mode ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052926462


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {

Review Comment:
   this is cool!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052919912


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.exists(_.isInstanceOf[AggregateExpression]))

Review Comment:
   ```suggestion
             a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055033212


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL
+      case _ => false
+    }
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_ATTRIBUTE, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star

Review Comment:
   the comment needs update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052910941


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.exists(_.isInstanceOf[AggregateExpression]))
+        a.copy(groupingExpressions = groupingExprs)

Review Comment:
   there's a usability issue here: the error message is kind of terrible when this rule cannot infer the columns and simply return Nil. the error msg will become an empty group by which is confusing. once we figure out the syntax, we should throw a better error in checkAnalysis.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052909245


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-star.sql:
##########
@@ -0,0 +1,45 @@
+-- group by all

Review Comment:
   do we need a test case for window functions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055044808


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   I think it's better to put column resolution in one rule. Relying on the rule order is not robust. Let's look at an example: let's say the plan tree is `A -> B -> C`, and only the leaf node `A` is resolved for now.
   1. `ResolveReference` resolves columns on `B`, but `B` is not fully resolved as it contains an unresolved function
   2. `ResolveFunctions` resolves functions in `B`, now `B` is fully resolved
   3. `ResolveGroupBy` resolves column `ALL` in `C`. This breaks the expected resolution order.
   
   Ideally, we want to run different column resolution rules in order for each plan node, not the entire plan tree. Merging them into a single rule is the most straightforward approach, or we need to design a complicated framework to achieve this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #39134:
URL: https://github.com/apache/spark/pull/39134#issuecomment-1362755021

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052911008


##########
sql/core/src/test/resources/sql-tests/results/group-by-star-mosha.sql.out:
##########
@@ -0,0 +1,141 @@
+-- Automatically generated by SQLQueryTestSuite
+-- !query
+create temporary view stuff as select * from values
+  (42, 9.75, 'hello world', '1970-08-07', '13.37', array(1,20,300)),
+  (1337, 1.2345, 'oh no', '2000-01-01', '42.0', array(4000,50000,600000)),
+  (42, 13.37, 'test', '1970-08-07', '1234567890', array(7000000,80000000,900000000))
+  as stuff(i, f, s, t, d, a)
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+SELECT 100 * SUM(i) + SUM(f) / COUNT(s) AS f1, i AS f2 FROM stuff GROUP BY * ORDER BY f2
+-- !query schema
+struct<f1:decimal(38,17),f2:int>
+-- !query output
+8411.56000000000000000	42
+133701.23450000000000000	1337
+
+
+-- !query
+SELECT i + 1 AS i1, COUNT(i - 2) ci, f / i AS fi, SUM(i + f) sif FROM stuff GROUP BY * ORDER BY 1, 3
+-- !query schema
+struct<i1:int,ci:bigint,fi:decimal(17,15),sif:decimal(25,4)>
+-- !query output
+43	1	0.232142857142857	51.7500
+43	1	0.318333333333333	55.3700
+1338	1	0.000923335826477	1338.2345
+
+
+-- !query
+SELECT i AS i, COUNT(i) ci, f AS f, SUM(i + f) sif FROM stuff GROUP BY * ORDER BY 1, i, 2, ci, 3, f, 4, sif
+-- !query schema
+struct<i:int,ci:bigint,f:decimal(6,4),sif:decimal(25,4)>
+-- !query output
+42	1	9.7500	51.7500
+42	1	13.3700	55.3700
+1337	1	1.2345	1338.2345
+
+
+-- !query
+SELECT i + 1, f / i, substring(s, 2, 3), extract(year from t), d / 2, size(a) FROM stuff
+GROUP BY * ORDER BY 1, 3, 4, 5, 6, 2
+-- !query schema
+struct<(i + 1):int,(f / i):decimal(17,15),substring(s, 2, 3):string,extract(year FROM t):int,(d / 2):double,size(a):int>
+-- !query output
+43	0.232142857142857	ell	1970	6.685	3
+43	0.318333333333333	est	1970	6.17283945E8	3
+1338	0.000923335826477	h n	2000	21.0	3
+
+
+-- !query
+SELECT i + SUM(f) FROM stuff GROUP BY *
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "MISSING_GROUP_BY",

Review Comment:
   The error message is
   ```
     "MISSING_GROUP_BY" : {
       "message" : [
         "The query does not include a GROUP BY clause. Add GROUP BY or turn it into the window functions using OVER clauses."
       ]
     },
   ```
   It's a bit misleading as users do specify the `GROUP BY *` clause.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1054902352


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL
+      case _ => false
+    }
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_ATTRIBUTE, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if matchToken(a) =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))

Review Comment:
   Sorry for the bad suggestion last time. How about using `AggregateExpression.isAggregate` here ?
   ```
   a.aggregateExpressions.filter(!_.exists(AggregateExpression.isAggregate))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052919759


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {

Review Comment:
   ```suggestion
       _.containsAllPatterns(UNRESOLVED_STAR, AGGREGATE), ruleId) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055047472


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   Can we take that discussion elsewhere? This is about group by all, not fixing the analyzer tech debt.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055083094


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   addressed as discussed offline, by adding another resolved check on child.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055047472


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   Can we take that discussion elsewhere? This is about group by all, not fixing the analyzer tech debt. And I 100% agree that we should fix the tech debt, just not here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055037350


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL
+      case _ => false
+    }
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_ATTRIBUTE, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if matchToken(a) =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))
+
+        // If the grouping exprs are empty, this could either be (1) a valid global aggregate, or
+        // (2) we simply fail to infer the grouping columns. As an example, in "i + sum(j)", we will
+        // not automatically infer the grouping column to be "i".
+        if (groupingExprs.isEmpty && a.aggregateExpressions.exists(containsAttribute)) {
+          // Case (2): don't replace the ALL. We will eventually tell the user in checkAnalysis
+          // that we cannot resolve the all in group by.
+          a
+        } else {
+          // Case (1): this is a valid global aggregate.
+          a.copy(groupingExpressions = groupingExprs)
+        }
+      } else {
+        a
+      }
+  }
+
+  /**
+   * Returns true if the expression includes an Attribute outside the aggregate expression part.
+   * For example:
+   *  "i" -> true
+   *  "i + 2" -> true
+   *  "i + sum(j)" -> true
+   *  "sum(j)" -> false
+   *  "sum(j) / 2" -> false
+   */
+  private def containsAttribute(expr: Expression): Boolean = expr match {
+    case _ if AggregateExpression.isAggregate(expr) =>
+      // Don't recurse into AggregateExpressions
+      false
+    case _: Attribute =>
+      true
+      // TODO: do we need to worry about ScalarSubquery here?

Review Comment:
   Good question. I think the idea is, we should only allow non-correlated subquery expression here. The check can be
   ```
   case sub: SubqueryExpression => sub.outerAttrs.nonEmpty
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052920262


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR

Review Comment:
   ```suggestion
   import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_STAR}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1053746714


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,72 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if the "group by all" is expanded to refer to the substr, then we'd have only 3 rows in output,
+-- because "USA" and "UK" both start with "U". if the "group by all" is referring to the all
+-- column, the output would have 4 rows.
+select substr(all, 0, 1), count(*) from (select country as all, id from data) group by all;

Review Comment:
   @cloud-fan i think this case happens to work today because ResolveGroupByAll runs after ResolveReferences. But since they are in the same batch, and ResolveReference doesn't run until fix point, is it possible that this won't work for more complex cases? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052911386


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.exists(_.isInstanceOf[AggregateExpression]))
+        a.copy(groupingExpressions = groupingExprs)

Review Comment:
   if `groupingExprs` is empty, shall we return the original plan, then we can throw a better error message later for `GROUP BY *` that doesn't match anything.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.exists(_.isInstanceOf[AggregateExpression]))
+        a.copy(groupingExpressions = groupingExprs)

Review Comment:
   if `groupingExprs` is empty, shall we return the original plan? then we can throw a better error message later for `GROUP BY *` that doesn't match anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052995619


##########
sql/core/src/test/resources/sql-tests/results/group-by-star-mosha.sql.out:
##########
@@ -0,0 +1,141 @@
+-- Automatically generated by SQLQueryTestSuite
+-- !query
+create temporary view stuff as select * from values
+  (42, 9.75, 'hello world', '1970-08-07', '13.37', array(1,20,300)),
+  (1337, 1.2345, 'oh no', '2000-01-01', '42.0', array(4000,50000,600000)),
+  (42, 13.37, 'test', '1970-08-07', '1234567890', array(7000000,80000000,900000000))
+  as stuff(i, f, s, t, d, a)
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+SELECT 100 * SUM(i) + SUM(f) / COUNT(s) AS f1, i AS f2 FROM stuff GROUP BY * ORDER BY f2
+-- !query schema
+struct<f1:decimal(38,17),f2:int>
+-- !query output
+8411.56000000000000000	42
+133701.23450000000000000	1337
+
+
+-- !query
+SELECT i + 1 AS i1, COUNT(i - 2) ci, f / i AS fi, SUM(i + f) sif FROM stuff GROUP BY * ORDER BY 1, 3
+-- !query schema
+struct<i1:int,ci:bigint,fi:decimal(17,15),sif:decimal(25,4)>
+-- !query output
+43	1	0.232142857142857	51.7500
+43	1	0.318333333333333	55.3700
+1338	1	0.000923335826477	1338.2345
+
+
+-- !query
+SELECT i AS i, COUNT(i) ci, f AS f, SUM(i + f) sif FROM stuff GROUP BY * ORDER BY 1, i, 2, ci, 3, f, 4, sif
+-- !query schema
+struct<i:int,ci:bigint,f:decimal(6,4),sif:decimal(25,4)>
+-- !query output
+42	1	9.7500	51.7500
+42	1	13.3700	55.3700
+1337	1	1.2345	1338.2345
+
+
+-- !query
+SELECT i + 1, f / i, substring(s, 2, 3), extract(year from t), d / 2, size(a) FROM stuff
+GROUP BY * ORDER BY 1, 3, 4, 5, 6, 2
+-- !query schema
+struct<(i + 1):int,(f / i):decimal(17,15),substring(s, 2, 3):string,extract(year FROM t):int,(d / 2):double,size(a):int>
+-- !query output
+43	0.232142857142857	ell	1970	6.685	3
+43	0.318333333333333	est	1970	6.17283945E8	3
+1338	0.000923335826477	h n	2000	21.0	3
+
+
+-- !query
+SELECT i + SUM(f) FROM stuff GROUP BY *
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "MISSING_GROUP_BY",

Review Comment:
   this is now fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055032840


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL

Review Comment:
   nit: `case a: UnresolvedAttribute if a.nameParts.length == 1 => a.nameParts.head.equalsIgnoreCase(ALL)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-41635][SQL] GROUP BY ALL [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #39134:
URL: https://github.com/apache/spark/pull/39134#issuecomment-1794484593

   @cuixiaozong good catch! Yea I think we can special case GROUP BY ALL in the ANSI parser. Would you like to open a PR to fix it? Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052933954


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.exists(_.isInstanceOf[AggregateExpression]))
+        a.copy(groupingExpressions = groupingExprs)

Review Comment:
   btw we cannot just return the original plan because then it wouldn't work for a legitimate case of global aggregates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1053846808


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   @cloud-fan i think this case happens to work today because ResolveGroupByAll runs after ResolveReferences. But since they are in the same batch, and ResolveReference doesn't run until fix point, is it possible that this won't work for more complex cases?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055033105


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL
+      case _ => false
+    }
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_ATTRIBUTE, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if matchToken(a) =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {

Review Comment:
   nit: we can move this check to the pattern match
   ```
   case a: Aggregate if a.aggregateExpressions.forall(_.resolved) && matchToken(a) =>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055051310


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL

Review Comment:
   `.name` builds a string, which is unnecessary if `nameParts.length > 1`. This doesn't affect correctness though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055044808


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   I think it's better to put column resolution in one rule. Relying on the rule order is not robust. Let's look at an example: let's say the plan tree is `A -> B -> C`, and only the leaf node `A` is resolved for now.
   1. `ResolveReference` resolves columns on `B`, but `B` is not fully resolved as it contains an unresolved function
   2. `ResolveFunctions` resolves functions in `B`, now `B` is fully resolved
   3. `ResolveGroupBy` resolves column `ALL` in `C`. This breaks the expected resolution order.
   
   Ideally, we want to run different column resolution rules in order for each plan node, not the entire plan tree. Merging them into a single rule is the most straightforward approach (and also efficient), or we need to design a complicated framework to achieve this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-41635][SQL] GROUP BY ALL [spark]

Posted by "cuixiaozong (via GitHub)" <gi...@apache.org>.

cuixiaozong commented on PR #39134:
URL: https://github.com/apache/spark/pull/39134#issuecomment-1794824606

   @cloud-fan sure. I've created https://issues.apache.org/jira/browse/SPARK-45806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1054923707


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   I don't think we should do an one off merging of some resolution logic into ResolveReference. If we want to do it, we should do it together in one shot. (It's also not clear to me if we should actually merge all resolution together.)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055047781


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL

Review Comment:
   what's the actual difference between the two? does .name not include the previous parts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1055051734


##########
sql/core/src/test/resources/sql-tests/inputs/group-by-all.sql:
##########
@@ -0,0 +1,71 @@
+-- group by all
+-- see https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o
+
+create temporary view data as select * from values
+  ("USA", "San Francisco", "Reynold", 1, 11.0),
+  ("USA", "San Francisco", "Matei", 2, 12.0),
+  ("USA", "Berkeley", "Xiao", 3, 13.0),
+  ("China", "Hangzhou", "Wenchen", 4, 14.0),
+  ("China", "Shanghai", "Shanghaiese", 5, 15.0),
+  ("Korea", "Seoul", "Hyukjin", 6, 16.0),
+  ("UK", "London", "Sean", 7, 17.0)
+  as data(country, city, name, id, power);
+
+-- basic
+select country, count(*) from data group by ALL;
+
+-- different case
+select country, count(*) from data group by aLl;
+
+-- a column named "all" would still work
+select all, city, count(*) from (select country as all, city, id from data) group by all, city;
+
+-- a column named "all" should take precedence over the normal group by all expansion
+-- if all refers to the column, then the following should return 3 rows.
+-- if all refers to the global aggregate, then 1 row.
+SELECT count(1) FROM VALUES(1), (2), (3) AS T(all) GROUP BY all;

Review Comment:
   OK, for this PR, to make `ResolveGroupByAll` a bit more robust, we can do
   ```
   object ResolveGroupByAll extends Rule[LogicalPlan] {
     def apply(plan: LogicalPlan) = ResolveReference(plan).transform ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #39134: [SPARK-41635][SQL] GROUP BY ALL
URL: https://github.com/apache/spark/pull/39134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052911704


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.exists(_.isInstanceOf[AggregateExpression]))
+        a.copy(groupingExpressions = groupingExprs)

Review Comment:
   race condition. see my comment above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052993302


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_STAR}
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_STAR, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))
+
+        // If the grouping exprs are empty, this could either be (1) a valid global aggregate, or
+        // (2) we simply fail to infer the grouping columns. As an example, in "i + sum(j)", we will
+        // not automatically infer the grouping column to be "i".
+        if (groupingExprs.isEmpty && a.aggregateExpressions.exists(containsAttribute)) {
+          // Case (2): don't replace the star. We will eventually tell the user in checkAnalysis
+          // that we cannot resolve the star in group by.
+          a
+        } else {
+          // Case (1): this is a valid global aggregate.
+          a.copy(groupingExpressions = groupingExprs)
+        }
+      } else {
+        a
+      }
+  }
+
+  /**
+   * Returns true if the expression includes an Attribute outside the aggregate expression part.
+   * For example:
+   *  "i" -> true
+   *  "i + 2" -> true
+   *  "i + sum(j)" -> true
+   *  "sum(j)" -> false
+   *  "sum(j) / 2" -> false
+   */
+  private def containsAttribute(expr: Expression): Boolean = expr match {
+    case _ if AggregateExpression.isAggregate(expr) =>
+      // Don't recurse into AggregateExpressions
+      false
+    case _: Attribute =>
+      true
+      // TODO: do we need to worry about ScalarSubquery here?

Review Comment:
   cc @cloud-fan do we need to worry about ScalarSubquery here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1053027317


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_STAR}
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_STAR, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))

Review Comment:
   You are right. I will fix it tomorrow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #39134: [WIP] Implement group by star (aka group by all)

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1052920528


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByStar.scala:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_STAR
+
+/**
+ * Resolve the star in the group by statement in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by *`.
+ *
+ * The star is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByStar extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsPattern(UNRESOLVED_STAR), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if a.groupingExpressions == UnresolvedStar(None) :: Nil =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.exists(_.isInstanceOf[AggregateExpression]))

Review Comment:
   Nit pick to make the analyzer rule more efficient. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rxin commented on a diff in pull request #39134: [SPARK-41635][SQL] GROUP BY ALL

Posted by GitBox <gi...@apache.org>.

rxin commented on code in PR #39134:
URL: https://github.com/apache/spark/pull/39134#discussion_r1053838400


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupByAll.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE, AGGREGATE_EXPRESSION, UNRESOLVED_ATTRIBUTE}
+
+/**
+ * Resolve "group by all" in the following SQL pattern:
+ *  `select col1, col2, agg_expr(...) from table group by all`.
+ *
+ * The all is expanded to include all non-aggregate columns in the select clause.
+ */
+object ResolveGroupByAll extends Rule[LogicalPlan] {
+
+  val ALL = "ALL"
+
+  /**
+   * Returns true iff this is a GROUP BY ALL aggregate. i.e. an Aggregate expression that has
+   * a single unresolved all grouping expression.
+   */
+  private def matchToken(a: Aggregate): Boolean = {
+    if (a.groupingExpressions.size != 1) {
+      return false
+    }
+    a.groupingExpressions.head match {
+      case a: UnresolvedAttribute => a.name.toUpperCase() == ALL
+      case _ => false
+    }
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
+    _.containsAllPatterns(UNRESOLVED_ATTRIBUTE, AGGREGATE), ruleId) {
+    // Match a group by with a single unresolved star
+    case a: Aggregate if matchToken(a) =>
+      // Only makes sense to do the rewrite once all the aggregate expressions have been resolved.
+      // Otherwise, we might incorrectly pull an actual aggregate expression over to the grouping
+      // expression list (because we don't know they would be aggregate expressions until resolved).
+      if (a.aggregateExpressions.forall(_.resolved)) {
+        val groupingExprs =
+          a.aggregateExpressions.filter(!_.containsPattern(AGGREGATE_EXPRESSION))
+
+        // If the grouping exprs are empty, this could either be (1) a valid global aggregate, or
+        // (2) we simply fail to infer the grouping columns. As an example, in "i + sum(j)", we will
+        // not automatically infer the grouping column to be "i".
+        if (groupingExprs.isEmpty && a.aggregateExpressions.exists(containsAttribute)) {
+          // Case (2): don't replace the ALL. We will eventually tell the user in checkAnalysis
+          // that we cannot resolve the all in group by.
+          a
+        } else {
+          // Case (1): this is a valid global aggregate.
+          a.copy(groupingExpressions = groupingExprs)
+        }
+      } else {
+        a
+      }
+  }
+
+  /**
+   * Returns true if the expression includes an Attribute outside the aggregate expression part.
+   * For example:
+   *  "i" -> true
+   *  "i + 2" -> true
+   *  "i + sum(j)" -> true
+   *  "sum(j)" -> false
+   *  "sum(j) / 2" -> false
+   */
+  private def containsAttribute(expr: Expression): Boolean = expr match {
+    case _ if AggregateExpression.isAggregate(expr) =>
+      // Don't recurse into AggregateExpressions
+      false
+    case _: Attribute =>
+      true
+      // TODO: do we need to worry about ScalarSubquery here?

Review Comment:
   cc @cloud-fan here's a question for you too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org