You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/22 00:55:26 UTC

[GitHub] [spark] maropu commented on a change in pull request #31913: [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function

maropu commented on a change in pull request #31913:
URL: https://github.com/apache/spark/pull/31913#discussion_r598368983



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -870,8 +870,19 @@ object CollapseProject extends Rule[LogicalPlan] with AliasHelper {
       if (haveCommonNonDeterministicOutput(p.projectList, agg.aggregateExpressions)) {
         p
       } else {
-        agg.copy(aggregateExpressions = buildCleanedProjectList(
-          p.projectList, agg.aggregateExpressions))
+        val complexGroupingExpressions =
+          ExpressionSet(agg.groupingExpressions.filter(_.children.nonEmpty))
+
+        def wrapGroupingExpression(e: Expression): Expression = e match {
+          case _: AggregateExpression => e
+          case _ if complexGroupingExpressions.contains(e) => GroupingExpression(e)
+          case _ => e.mapChildren(wrapGroupingExpression)
+        }
+
+        val wrappedAggregateExpressions =
+          agg.aggregateExpressions.map(wrapGroupingExpression(_).asInstanceOf[NamedExpression])
+        agg.copy(aggregateExpressions =
+          buildCleanedProjectList(p.projectList, wrappedAggregateExpressions))

Review comment:
       I think this issue is not related to `CollapseProject`. For example, we can reproduce it like this;
   ```
   $ ./bin/spark-shell --conf spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.CollapseProject
   
   Seq(Some(1), None).toDF("id").createOrReplaceTempView("t")
   val df = sql("""
     SELECT NOT(t.id IS NULL) AS X, count(*) AS c
     FROM t
     GROUP BY t.id IS NULL
   """)
   
   df.show()
   ```
   
   The query fails because `BooleanSimplification` applys illegal expr transformation to break the group-by constraint (that is, `group-by columns must appear in aggregate exprs`). In the ohter DBMS-like systems (e.g., PostgreSQL), the transformed query fails like this;
   ```
   postgres=# SELECT t.id IS NOT NULL AS X, count(*) AS c FROM t GROUP BY t.id IS NULL;
   ERROR:  column "t.id" must appear in the GROUP BY clause or be used in an aggregate function
   LINE 1: SELECT t.id IS NOT NULL AS X, count(*) AS c FROM t GROUP BY ...
   ```
   I'm currently not sure that this issue can happen only in `BooleanSimplification` though, I think we need a general solution to fix this kind of the illegal transformation instead of band-aid fixes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org