You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "kelvinjian-db (via GitHub)" <gi...@apache.org> on 2023/10/18 04:27:15 UTC

[PR] [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees [spark]

kelvinjian-db opened a new pull request, #43420:
URL: https://github.com/apache/spark/pull/43420

### What changes were proposed in this pull request?

* Included rule ID pruning when traversing the expression trees in `TypeCoercionRule` (this avoids us from traversing the expression tree over and over again in future iterations of the rule)
* Improved `EquivalentExpressions`:
* Since `supportedExpression()` is checking for the existence of a pattern in the tree, changed to check the `TreePatternBits` instead of recursing using `.exists()`
* When creating an `ExpressionEquals` object, calculating the height requires recursing through all of its children, which is O(n^2) when called upon each expression in the expression tree. This changes it so that this height is cached in the `TreeNode`, so that it is now O(n) when called upon each expression in the tree
* More targeted TreePatternBits pruning in `ResolveTimeZone` and `ConstantPropagation`

### Why are the changes needed?

This PR improves some analyzer and optimizer rules to address inefficiencies when handling extremely large expression trees.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

There should be no plan changes, so no unit tests were modified.

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43420:
URL: https://github.com/apache/spark/pull/43420#discussion_r1363348250


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##########
@@ -244,13 +239,9 @@ class EquivalentExpressions(
  * Wrapper around an Expression that provides semantic equality.
  */
 case class ExpressionEquals(e: Expression) {
-  private def getHeight(tree: Expression): Int = {
-    tree.children.map(getHeight).reduceOption(_ max _).getOrElse(0) + 1
-  }
-
   // This is used to do a fast pre-check for child-parent relationship. For example, expr1 can
   // only be a parent of expr2 if expr1.height is larger than expr2.height.
-  lazy val height = getHeight(e)
+  lazy val height: Int = e.height

Review Comment:
   ```suggestion
     def height: Int = e.height
   ```
   should this be a simple def now as the `e.height` is already a lazy val



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees [spark]

Posted by "kelvinjian-db (via GitHub)" <gi...@apache.org>.

kelvinjian-db commented on code in PR #43420:
URL: https://github.com/apache/spark/pull/43420#discussion_r1364489772


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##########
@@ -244,13 +239,9 @@ class EquivalentExpressions(
  * Wrapper around an Expression that provides semantic equality.
  */
 case class ExpressionEquals(e: Expression) {
-  private def getHeight(tree: Expression): Int = {
-    tree.children.map(getHeight).reduceOption(_ max _).getOrElse(0) + 1
-  }
-
   // This is used to do a fast pre-check for child-parent relationship. For example, expr1 can
   // only be a parent of expr2 if expr1.height is larger than expr2.height.
-  lazy val height = getHeight(e)
+  lazy val height: Int = e.height

Review Comment:
   done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees [spark]

Posted by "kelvinjian-db (via GitHub)" <gi...@apache.org>.

kelvinjian-db commented on code in PR #43420:
URL: https://github.com/apache/spark/pull/43420#discussion_r1363258110


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##########
@@ -244,13 +239,9 @@ class EquivalentExpressions(
  * Wrapper around an Expression that provides semantic equality.
  */
 case class ExpressionEquals(e: Expression) {
-  private def getHeight(tree: Expression): Int = {
-    tree.children.map(getHeight).reduceOption(_ max _).getOrElse(0) + 1
-  }
-
   // This is used to do a fast pre-check for child-parent relationship. For example, expr1 can
   // only be a parent of expr2 if expr1.height is larger than expr2.height.
-  lazy val height = getHeight(e)
+  lazy val height = e.height

Review Comment:
   done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43420:
URL: https://github.com/apache/spark/pull/43420#issuecomment-1769787870

   The failed pyspark tests are unrelated, I'm merging it to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #43420: [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees
URL: https://github.com/apache/spark/pull/43420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45586][SQL] Reduce compiler latency for plans with large expression trees [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #43420:
URL: https://github.com/apache/spark/pull/43420#discussion_r1363195122


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##########
@@ -244,13 +239,9 @@ class EquivalentExpressions(
  * Wrapper around an Expression that provides semantic equality.
  */
 case class ExpressionEquals(e: Expression) {
-  private def getHeight(tree: Expression): Int = {
-    tree.children.map(getHeight).reduceOption(_ max _).getOrElse(0) + 1
-  }
-
   // This is used to do a fast pre-check for child-parent relationship. For example, expr1 can
   // only be a parent of expr2 if expr1.height is larger than expr2.height.
-  lazy val height = getHeight(e)
+  lazy val height = e.height

Review Comment:
   nit:If possible, please change it to `lazy val height: Int = e.height`, public members should have type declarations. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org