You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2021/11/29 08:47:34 UTC
[spark] branch master updated: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
This is an automated email from the ASF dual-hosted git repository.
wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 5d09828 [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
5d09828 is described below
commit 5d098286a003eac26d1fe6c026d8123f3cbe9e9c
Author: Josh Rosen <jo...@databricks.com>
AuthorDate: Mon Nov 29 16:46:28 2021 +0800
[SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
### What changes were proposed in this pull request?
This PR adds caching to `LogicalPlan.isStreaming()`: the default implementation's result will now be cached in a `private lazy val`.
### Why are the changes needed?
This improves the performance of the `DeduplicateRelations` analyzer rule.
The default implementation of `isStreaming` recursively visits every node in the tree. `DeduplicateRelations.renewDuplicatedRelations` is recursively invoked on every node in the tree and each invocation calls `isStreaming`. This leads to `O(n^2)` invocations of `isStreaming` on leaf nodes.
Caching `isStreaming` avoids this performance problem.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Correctness should be covered by existing tests.
This significantly improved `DeduplicateRelations` performance in local microbenchmarking with large query plans (~20% reduction in that rule's runtime in one of my tests).
Closes #34691 from JoshRosen/cache-LogicalPlan.isStreaming.
Authored-by: Josh Rosen <jo...@databricks.com>
Signed-off-by: Wenchen Fan <we...@databricks.com>
---
.../org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index 7c31a00..4aa7bf1 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -41,7 +41,8 @@ abstract class LogicalPlan
def metadataOutput: Seq[Attribute] = children.flatMap(_.metadataOutput)
/** Returns true if this subtree has data from a streaming data source. */
- def isStreaming: Boolean = children.exists(_.isStreaming)
+ def isStreaming: Boolean = _isStreaming
+ private[this] lazy val _isStreaming = children.exists(_.isStreaming)
override def verboseStringWithSuffix(maxFields: Int): String = {
super.verboseString(maxFields) + statsCache.map(", " + _.toString).getOrElse("")
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org