You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2021/11/29 08:47:34 UTC

[spark] branch master updated: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 5d09828  [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
5d09828 is described below

commit 5d098286a003eac26d1fe6c026d8123f3cbe9e9c
Author: Josh Rosen <jo...@databricks.com>
AuthorDate: Mon Nov 29 16:46:28 2021 +0800

    [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
    
    ### What changes were proposed in this pull request?
    
    This PR adds caching to `LogicalPlan.isStreaming()`: the default implementation's result will now be cached in a `private lazy val`.
    
    ### Why are the changes needed?
    
    This improves the performance of the `DeduplicateRelations` analyzer rule.
    
    The default implementation of `isStreaming` recursively visits every node in the tree. `DeduplicateRelations.renewDuplicatedRelations` is recursively invoked on every node in the tree and each invocation calls `isStreaming`. This leads to `O(n^2)` invocations of `isStreaming` on leaf nodes.
    
    Caching `isStreaming` avoids this performance problem.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Correctness should be covered by existing tests.
    
    This significantly improved `DeduplicateRelations` performance in local microbenchmarking with large query plans (~20% reduction in that rule's runtime in one of my tests).
    
    Closes #34691 from JoshRosen/cache-LogicalPlan.isStreaming.
    
    Authored-by: Josh Rosen <jo...@databricks.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
---
 .../org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala      | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index 7c31a00..4aa7bf1 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -41,7 +41,8 @@ abstract class LogicalPlan
   def metadataOutput: Seq[Attribute] = children.flatMap(_.metadataOutput)
 
   /** Returns true if this subtree has data from a streaming data source. */
-  def isStreaming: Boolean = children.exists(_.isStreaming)
+  def isStreaming: Boolean = _isStreaming
+  private[this] lazy val _isStreaming = children.exists(_.isStreaming)
 
   override def verboseStringWithSuffix(maxFields: Int): String = {
     super.verboseString(maxFields) + statsCache.map(", " + _.toString).getOrElse("")

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org