You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/05 15:41:03 UTC

[GitHub] [spark] cloud-fan commented on a change in pull request #31913: [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function

cloud-fan commented on a change in pull request #31913:
URL: https://github.com/apache/spark/pull/31913#discussion_r607159190



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -622,31 +624,47 @@ case class Range(
 /**
  * This is a Group by operator with the aggregate functions and projections.
  *
- * @param groupingExpressions expressions for grouping keys
- * @param aggregateExpressions expressions for a project list, which could contain
- *                             [[AggregateExpression]]s.
+ * @param groupingExpressions Expressions for grouping keys.
+ * @param aggrExprWithGroupingRefs Expressions for a project list, which could contain
+ *                                 [[AggregateExpression]]s and [[GroupingExprRef]]s.
+ * @param child The child of the aggregate node.
+ * @param enforceGroupingReferences If [[aggrExprWithGroupingRefs]] should contain
+ *                                  [[GroupingExprRef]]s.
+ *
+ * [[aggrExprWithGroupingRefs]] without aggregate functions can contain [[GroupingExprRef]]
+ * expressions to refer to complex grouping expressions in [[groupingExpressions]]. These references
+ * ensure that optimization rules don't change the aggregate expressions to invalid ones that no
+ * longer refer to any grouping expressions and also simplify the expression transformations on the
+ * node (need to transform the expression only once).
  *
- * Note: Currently, aggregateExpressions is the project list of this Group by operator. Before
- * separating projection from grouping and aggregate, we should avoid expression-level optimization
- * on aggregateExpressions, which could reference an expression in groupingExpressions.
- * For example, see the rule [[org.apache.spark.sql.catalyst.optimizer.SimplifyExtractValueOps]]
+ * For example, in the following query Spark shouldn't optimize the aggregate expression
+ * `Not(IsNull(c))` to `IsNotNull(c)` as the grouping expression is `IsNull(c)`:
+ * SELECT not(c IS NULL)
+ * FROM t
+ * GROUP BY c IS NULL
+ * Instead, the aggregate expression should contain `Not(GroupingExprRef(0))`.
  */
-case class Aggregate(
+case class AggregateWithGroupingReferences(

Review comment:
       This change is much bigger than I expect. What's the reason here? I was expecting something like
   1. A rule that runs at the beginning of optimizer, to create the grouping refs.
   2. Restore grouping refs to the original expressions when converting logical aggregate to phyiscal versions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org