You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "peter-toth (via GitHub)" <gi...@apache.org> on 2023/03/20 12:32:15 UTC

[GitHub] [spark] peter-toth opened a new pull request, #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

peter-toth opened a new pull request, #40488:
URL: https://github.com/apache/spark/pull/40488

   ### What changes were proposed in this pull request?
   This PR proposes to replace `EquivalentExpressions` to a simple mutable map in `PhysicalAggregation`, the only place where `EquivalentExpressions.addExpr()` is used. `EquivalentExpressions` is useful for common subexpression elimination but in `PhysicalAggregation` it is used only to deduplicate whole expressions which can be easily done with a simple map.
   
   ### Why are the changes needed?
   `EquivalentExpressions.addExpr()` is not guarded by `supportedExpression()` which can cause inconsistent results when used together with `EquivalentExpressions.getExprState()`, but this PR also proposes replacing `.addExpr()` with other alternatives as its boolean result is a bit counter-intuitive to other collections' `.add()` methods. It returns `false` if the expression was missing and either adds the expression or not depending on if the expression is deterministic.
   After this PR we no longer use `EquivalentExpressions.addExpr()` so it can be deprecated or even removed...
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Added new UTs from @rednaxelafx's PR: https://github.com/apache/spark/pull/40473. Please note that new UTs actually pass after https://github.com/apache/spark/pull/40475, but they are added here to make sure there will be no regression in the future.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #40488:
URL: https://github.com/apache/spark/pull/40488#discussion_r1142961725


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##########
@@ -296,12 +298,17 @@ object PhysicalAggregation {
       // build a set of semantically distinct aggregate expressions and re-write expressions so
       // that they reference the single copy of the aggregate function which actually gets computed.
       // Non-deterministic aggregate expressions are not deduplicated.
-      val equivalentAggregateExpressions = new EquivalentExpressions
+      val equivalentAggregateExpressions = mutable.Map.empty[Expression, Expression]
       val aggregateExpressions = resultExpressions.flatMap { expr =>
         expr.collect {
-          // addExpr() always returns false for non-deterministic expressions and do not add them.
           case a
-            if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) =>

Review Comment:
   Besides the above, although `.addExpr()` fits here well and does the job, isn't it a bit weird that an add-like method of a collection-like object doesn't return `true` when an item added, but actually it flips the meaning of the return value? If it was used at multiple places then I would keep it, but we use it only here. Maybe I'm just nitpicking...
   Anyways, I'm ok with https://github.com/apache/spark/pull/40473 too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rednaxelafx commented on pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "rednaxelafx (via GitHub)" <gi...@apache.org>.
rednaxelafx commented on PR #40488:
URL: https://github.com/apache/spark/pull/40488#issuecomment-1477131544

   Before the recent rounds of changes to EquivalentExpressions, the old `addExprTree` used to call `addExpr` in its core:
   https://github.com/apache/spark/blob/branch-2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L90
   That was still the case when `PhysicalAggregation` started using this mechanism to deduplicate expressions. I guess it started becoming "detached" from the main path when the recent refactoring happened that allows updating a separate equivalence map instead of the "main" one.
   
   Your proposed PR here further orphans that function from any actual use. Which is okay for keeping binary compatibility as much as possible.
   The inlined dedup logic in `PhysicalAggregation` looks rather ugly though. I don't have a strong opinion about that as long as other reviewers are okay. I'd prefer still retaining some sort of encapsulated collection for the dedup usage.
   
   BTW I updated my PR's test case because it makes more sense to check the return value from `addExpr` on the second invocation rather than on the first (both "not supported expression" and actual new expression cases would have gotten a `false` return value if we add that guard to the `addExpr()` function).
   https://github.com/apache/spark/pull/40473/commits/28d101ee6765c5453189fa62d6b8ade1568d99d2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40488:
URL: https://github.com/apache/spark/pull/40488#discussion_r1142837488


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##########
@@ -296,12 +298,17 @@ object PhysicalAggregation {
       // build a set of semantically distinct aggregate expressions and re-write expressions so
       // that they reference the single copy of the aggregate function which actually gets computed.
       // Non-deterministic aggregate expressions are not deduplicated.
-      val equivalentAggregateExpressions = new EquivalentExpressions
+      val equivalentAggregateExpressions = mutable.Map.empty[Expression, Expression]
       val aggregateExpressions = resultExpressions.flatMap { expr =>
         expr.collect {
-          // addExpr() always returns false for non-deterministic expressions and do not add them.
           case a
-            if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) =>

Review Comment:
   what's wrong with `addExpr` here? It does simplify the code IMO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #40488:
URL: https://github.com/apache/spark/pull/40488#discussion_r1142961725


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##########
@@ -296,12 +298,17 @@ object PhysicalAggregation {
       // build a set of semantically distinct aggregate expressions and re-write expressions so
       // that they reference the single copy of the aggregate function which actually gets computed.
       // Non-deterministic aggregate expressions are not deduplicated.
-      val equivalentAggregateExpressions = new EquivalentExpressions
+      val equivalentAggregateExpressions = mutable.Map.empty[Expression, Expression]
       val aggregateExpressions = resultExpressions.flatMap { expr =>
         expr.collect {
-          // addExpr() always returns false for non-deterministic expressions and do not add them.
           case a
-            if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) =>

Review Comment:
   Besides the above, although `.addExpr()` fits here well and does the job, isn't it a bit weird that an add-like method of a collection-like object doesn't return `true` when an item was added, but actually it flips the meaning of the return value? If it was used at multiple places then I would keep it, but we use it only here. Maybe I'm just nitpicking...
   Anyways, I'm ok with https://github.com/apache/spark/pull/40473 too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rednaxelafx commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "rednaxelafx (via GitHub)" <gi...@apache.org>.
rednaxelafx commented on code in PR #40488:
URL: https://github.com/apache/spark/pull/40488#discussion_r1142905889


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##########
@@ -296,12 +298,17 @@ object PhysicalAggregation {
       // build a set of semantically distinct aggregate expressions and re-write expressions so
       // that they reference the single copy of the aggregate function which actually gets computed.
       // Non-deterministic aggregate expressions are not deduplicated.
-      val equivalentAggregateExpressions = new EquivalentExpressions
+      val equivalentAggregateExpressions = mutable.Map.empty[Expression, Expression]
       val aggregateExpressions = resultExpressions.flatMap { expr =>
         expr.collect {
-          // addExpr() always returns false for non-deterministic expressions and do not add them.
           case a
-            if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) =>

Review Comment:
   The line of thought would be: adding the `supportedExpression` guard to `addExpr()` would cause performance regression, so let's just close our eyes and make the only remaining use of `addExpr` break away and do its own deduplication in the old logic without taking things like `NamedLambdaVariable` into account -- which is the way it's been for quite a few releases. This PR essentially inlines the `addExpr` path of the old `EquivalentExpressions` into `PhysicalAggregation` to recover what it used to do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #40488:
URL: https://github.com/apache/spark/pull/40488#issuecomment-1476454588

   @rednaxelafx, @cloud-fan let me know it this PR is a viable alternative to https://github.com/apache/spark/pull/40473. Or maybe if I should do a little cleanup like https://github.com/peter-toth/spark/commit/90421cb74dd0d0cfe2693cab39a25cd7892c0d45 in this or in a follow-up PR...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #40488:
URL: https://github.com/apache/spark/pull/40488#discussion_r1142961725


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##########
@@ -296,12 +298,17 @@ object PhysicalAggregation {
       // build a set of semantically distinct aggregate expressions and re-write expressions so
       // that they reference the single copy of the aggregate function which actually gets computed.
       // Non-deterministic aggregate expressions are not deduplicated.
-      val equivalentAggregateExpressions = new EquivalentExpressions
+      val equivalentAggregateExpressions = mutable.Map.empty[Expression, Expression]
       val aggregateExpressions = resultExpressions.flatMap { expr =>
         expr.collect {
-          // addExpr() always returns false for non-deterministic expressions and do not add them.
           case a
-            if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) =>

Review Comment:
   Besides the above, although `.addExpr()` fits here well and does the job, isn't it a bit weird that an add-like method of a collection-like object doesn't return `true` when a new item was added, but actually it flips the meaning of the return value? If it was used at multiple places then I would keep it, but we use it only here. But maybe I'm just nitpicking...
   Anyways, I'm ok with https://github.com/apache/spark/pull/40473 too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth closed pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth closed pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation
URL: https://github.com/apache/spark/pull/40488


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org