You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/06 02:50:13 UTC

[GitHub] [spark] viirya opened a new pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

viirya opened a new pull request #29950:
URL: https://github.com/apache/spark/pull/29950


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch proposes to avoid collapsing adjacent `Project` in query optimizer if the combined `Project` will duplicate too many common expressions. One SQL config `spark.sql.optimizer.maxCommonExprsInCollapseProject` is added to set up the maximum allowed number of common expressions.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   In some edge cases, collapsing adjacent `Project` hurts performance, instead of improving it. We observed such behavior in our customer Spark jobs where one expensive expression was repeatedly duplicated many times. It is hard to have a optimizer rule that could decide whether to collapse two `Project`s because we don't know the cost of each expression. Currently we can provide a SQL config so users can set it up to change optimizer's behavior regarding collapsing adjacent `Project`s.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   Yes. Users can change optimizer's behavior regarding collapsing `Project`s by setting SQL config.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Unit test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706452537


   **[Test build #129607 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129607/testReport)** for PR 29950 at commit [`4bf4dc2`](https://github.com/apache/spark/commit/4bf4dc252c663afce30848481ec3887a70f7b611).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706388940


   **[Test build #129597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129597/testReport)** for PR 29950 at commit [`43eb50d`](https://github.com/apache/spark/commit/43eb50d57eb334395c9cbe7b193de28644c62e68).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704661313


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34092/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502313752



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],

Review comment:
       indentation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725898127






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725911884


   **[Test build #130972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130972/testReport)** for PR 29950 at commit [`58e71d8`](https://github.com/apache/spark/commit/58e71d85fb0752a2963b9e343d321cf1c6a8e634).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704660279


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34091/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502311982



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +
+        "can be collapsed into upper Project from lower Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging.")
+      .version("3.1.0")
+      .intConf
+      .createWithDefault(20)

Review comment:
       Just a question. Is there a reason to choose `20`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -732,10 +732,12 @@ object ColumnPruning extends Rule[LogicalPlan] {
  *    `GlobalLimit(LocalLimit)` pattern is also considered.

Review comment:
       Could you update the optimizer description according to the new conf?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -732,10 +732,12 @@ object ColumnPruning extends Rule[LogicalPlan] {
  *    `GlobalLimit(LocalLimit)` pattern is also considered.
  */
 object CollapseProject extends Rule[LogicalPlan] {
-
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case p1 @ Project(_, p2: Project) =>
-      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+        getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {

Review comment:
       indentation?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],

Review comment:
       indentation?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],

Review comment:
       indentation? It seems that there is one more space here.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],
+       aliases: AttributeMap[Expression]): Boolean = {
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    expr.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    val commonOutputs = if (exprMap.size > 0) {
+      exprMap.maxBy(_._2)._2
+    } else {
+      0
+    }
+
+    commonOutputs > maxCommonExprs
+  }
+
   private def collectProjectsAndFilters(plan: LogicalPlan): ScanReturnType = {
     plan match {
       case Project(fields, child) =>
         collectProjectsAndFilters(child) match {
           case Some((_, filters, other, aliases)) =>
             // Follow CollapseProject and only keep going if the collected Projects
-            // do not have common non-deterministic expressions.
-            if (!hasCommonNonDeterministic(fields, aliases)) {
+            // do not have common non-deterministic expressions, or do not have equal to/more than
+            // maximum allowed common outputs.
+            if (!hasCommonNonDeterministic(fields, aliases)
+                || !moreThanMaxAllowedCommonOutput(fields, aliases)) {

Review comment:
       nit, you may want to move `||` into line 157.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706486885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704038432


   > Related to #29094 ?
   
   No, after did a quick scan of that PR. That PR targets driver OOM caused by too many leaf expressions in collapsed Project. Here this diff cares about duplicated common expressions in collapsed Project. Different problems, I think.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708704286


   **[Test build #129773 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129773/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733718994


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r500685517



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {

Review comment:
       Could you add end-2-end tests having physical planning? I'm worried that this might have the same issue I described in https://github.com/apache/spark/pull/29094#discussion_r458201704




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708706315


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34366/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502312406



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -732,10 +732,12 @@ object ColumnPruning extends Rule[LogicalPlan] {
  *    `GlobalLimit(LocalLimit)` pattern is also considered.

Review comment:
       Could you update the optimizer description according to the new conf?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704665653






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704665653






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704664916






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712402369


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34620/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733444096


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706393689






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726521998






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522669719



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -758,6 +756,42 @@ object CollapseProject extends Rule[LogicalPlan] with AliasHelper {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          moreThanMaxAllowedCommonOutput(p1.projectList, p2.projectList)) {
+        p1
+      } else {
+        collapseProjects(
+          p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList)))
+      }
+    case _ => plan
+  }
+
+  private def collectAliases(projectList: Seq[NamedExpression]): AttributeMap[Alias] = {
+    AttributeMap(projectList.collect {
+      case a: Alias => a.toAttribute -> a
+    })
+  }
+
+  // Whether the largest times common outputs from lower operator used in upper operators is

Review comment:
       `upper operators` -> `upper operator`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712508933


   **[Test build #130013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130013/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704076658






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707488433


   **[Test build #129727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129727/testReport)** for PR 29950 at commit [`9bfafc7`](https://github.com/apache/spark/commit/9bfafc75d66987fd550052e6b3d53efe703b55ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r500693735



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {

Review comment:
       I added similar check into `ScanOperation` rule. I will look at adding tests for physical plan later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706383993


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129595/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502808049



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -216,7 +216,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
     // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
     Batch("Check Cartesian Products", Once,
       CheckCartesianProducts) :+
-    Batch("RewriteSubquery", Once,
+    Batch("RewriteSubquery", fixedPoint,
       RewritePredicateSubquery,
       ColumnPruning,
       CollapseProject,

Review comment:
       `ColumnPruning` cannot finish all collapsing in one shot. `Once` will break idempotence.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712381928


   **[Test build #130013 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130013/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707511553






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725928192


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r509035787



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +

Review comment:
       Yeah, but currently if we exclude `CollapseProject`, `ScanOperation` will work and collapse projections. Maybe update this doc?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704662009


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129485/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-740863252


   Thank you for your decision, @viirya !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733567816






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r511905797



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +126,32 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(

Review comment:
       nit: private?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704076667


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129432/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712509801






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708681448


   **[Test build #129760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129760/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725964555


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35589/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725982304


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714699733






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704075824


   **[Test build #129432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129432/testReport)** for PR 29950 at commit [`f418714`](https://github.com/apache/spark/commit/f41871490da4df5d7cb5d352f5ef3795e8a6f625).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733728541


   **[Test build #131780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131780/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r500687998



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {

Review comment:
       Oh, good point. We still change projections even in physical plan...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733897168






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522674374



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -724,20 +725,17 @@ object ColumnPruning extends Rule[LogicalPlan] {
 /**
  * Combines two [[Project]] operators into one and perform alias substitution,
  * merging the expressions into one single expression for the following cases.
- * 1. When two [[Project]] operators are adjacent.
+ * 1. When two [[Project]] operators are adjacent, if the number of common expressions in the
+ *    combined [[Project]] is not more than `spark.sql.optimizer.maxCommonExprsInCollapseProject`.
  * 2. When two [[Project]] operators have LocalLimit/Sample/Repartition operator between them
  *    and the upper project consists of the same number of columns which is equal or aliasing.
  *    `GlobalLimit(LocalLimit)` pattern is also considered.
  */
 object CollapseProject extends Rule[LogicalPlan] with AliasHelper {
 
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case p1 @ Project(_, p2: Project) =>
-      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
-        p1
-      } else {
-        p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
-      }
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {

Review comment:
       I found the previous comment about supporting `withColumn`. If this is designed for that, shall we add a test case for that?
   - https://github.com/apache/spark/pull/29950#discussion_r501976989




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708839963


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704680294






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502200210



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       This seems can be fixed by using `transformDown` instead? Seems to me `CollapseProject` is not necessarily to use `transformUp` if I don't miss anything. cc @cloud-fan @maropu 

##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       Seems like we need to change to `transformDown` and take a recursive approach like `RemoveRedundantProjects` and `recursiveRemoveSort` for collapsing `Project`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712413174


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34620/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733865440






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725886039


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35578/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704561415


   > Perhaps the max number of common expressions is not the best metric here?
   > 
   > Lets compare two cases:
   > 
   >     1. On the lower project you have a `JsonToStructs` and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.
   > 
   >     2. On the lower project you have two `JsonToStructs` and on upper Project you get 2 fields from both stucts. This would also mean 2 redundant computations and the "metric" you are looking at is 2.
   > 
   > 
   > Adding more `JsonToStructs` to the lower level would increase the number redundant computations without increasing the max value.
   > So as an alternative I would propose "the number of redundant computations" (sum of values in the `exprMap` minus its size) as a metric to use.
   > 
   > Although I must admit, that in that case we might cache more values for the number of extra computations we save.
   > So both of them have their benefits.
   
   Yes, in the case you add the number of redundant computations each time you add one more `JsonToStructs`. But overall it should not cause noticeable performance issue because you simply three times the running cost of `JsonToStructs` (assume each `JsonToStructs` has 2 redundant computations). In our case we notice performance issue when one `JsonToStructs` is repeated more than 50 times.
   
   The number of redundant computations is misleading. If we have 100 `JsonToStructs` in lower project, each of them has 2 redundant computations in upper project, it doesn't mean we 100 times the running cost of `JsonToStructs`. In other words, it is hard to tell the performance difference between 10 and 20 redundant computations. If the redundant computations come from the same expression, then we have 10 times v.s. 20 times running cost, but if they are from 10 expressions? We might just have 2 to 3 times running cost.
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502657789



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {

Review comment:
       Added end-to-end test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712378966


   @maropu @dongjoon-hyun Any more comments?
   
   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707851438


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726087712


   **[Test build #130983 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130983/testReport)** for PR 29950 at commit [`58e71d8`](https://github.com/apache/spark/commit/58e71d85fb0752a2963b9e343d321cf1c6a8e634).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r503665670



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -760,12 +756,43 @@ object CollapseProject extends Rule[LogicalPlan] {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {
+        p1
+      } else {
+        collapseProjects(
+          p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList)))
+      }
+    case _ => plan
+  }
+
   private def collectAliases(projectList: Seq[NamedExpression]): AttributeMap[Alias] = {
     AttributeMap(projectList.collect {
       case a: Alias => a.toAttribute -> a
     })
   }
 
+  // Counts for the largest times common outputs from lower operator are used in upper operators.
+  private def getLargestNumOfCommonOutput(

Review comment:
       Two places looks similar however the parameters are slightly different. We can make them share same code, but the code lines are just few and refactoring needs more change, so seems not worth to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733865440






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704566080


   cc @cloud-fan @dongjoon-hyun too


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706459003


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34210/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706383760


   **[Test build #129595 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129595/testReport)** for PR 29950 at commit [`76509b3`](https://github.com/apache/spark/commit/76509b36cf2afef38a2794422cf5684f86bd1cf6).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706368133


   **[Test build #129595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129595/testReport)** for PR 29950 at commit [`76509b3`](https://github.com/apache/spark/commit/76509b36cf2afef38a2794422cf5684f86bd1cf6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704561415


   > Perhaps the max number of common expressions is not the best metric here?
   > 
   > Lets compare two cases:
   > 
   >     1. On the lower project you have a `JsonToStructs` and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.
   > 
   >     2. On the lower project you have two `JsonToStructs` and on upper Project you get 2 fields from both stucts. This would also mean 2 redundant computations and the "metric" you are looking at is 2.
   > 
   > 
   > Adding more `JsonToStructs` to the lower level would increase the number redundant computations without increasing the max value.
   > So as an alternative I would propose "the number of redundant computations" (sum of values in the `exprMap` minus its size) as a metric to use.
   > 
   > Although I must admit, that in that case we might cache more values for the number of extra computations we save.
   > So both of them have their benefits.
   
   Yes, in the case you add the number of redundant computations each time you add one more `JsonToStructs`. But overall it should not cause noticeable performance issue because you simply three times the running cost of `JsonToStructs` (assume each `JsonToStructs` has 2 redundant computations).
   
   The number of redundant computations is misleading. If we have 100 `JsonToStructs` in lower project, each of them has 2 redundant computations in upper project, it doesn't mean we 100 times the running cost of `JsonToStructs`. In other words, it is hard to tell the performance difference between 10 and 20 redundant computations. If the redundant computations come from the same expression, then we have 10 times v.s. 20 times running cost, but if they are from 10 expressions? We might just have 2 to 3 times running cost.
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r503662637



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -216,7 +216,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
     // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
     Batch("Check Cartesian Products", Once,
       CheckCartesianProducts) :+
-    Batch("RewriteSubquery", Once,
+    Batch("RewriteSubquery", fixedPoint,

Review comment:
       Isn't `FixedPoint(1)` also to run the batch once?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726445288






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706383984


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726531144


   The GitHub Action's flakiness at `sql - slow tests` is fixed via https://github.com/apache/spark/pull/30365 .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708875057


   **[Test build #129773 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129773/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707502613


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34333/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r500047652



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -732,10 +732,12 @@ object ColumnPruning extends Rule[LogicalPlan] {
  *    `GlobalLimit(LocalLimit)` pattern is also considered.
  */
 object CollapseProject extends Rule[LogicalPlan] {
-
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case p1 @ Project(_, p2: Project) =>
-      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+        getLargestNumOfCommonOutput(p1.projectList, p2.projectList) >= maxCommonExprs) {

Review comment:
       Perhaps this comparison should be `>` instead of `>=`, because currently the actual max value is `maxCommonExprs - 1`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708681878


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733450335


   **[Test build #131729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131729/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726521054


   **[Test build #131024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131024/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502314576



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],
+       aliases: AttributeMap[Expression]): Boolean = {
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    expr.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    val commonOutputs = if (exprMap.size > 0) {
+      exprMap.maxBy(_._2)._2
+    } else {
+      0
+    }
+
+    commonOutputs > maxCommonExprs
+  }
+
   private def collectProjectsAndFilters(plan: LogicalPlan): ScanReturnType = {
     plan match {
       case Project(fields, child) =>
         collectProjectsAndFilters(child) match {
           case Some((_, filters, other, aliases)) =>
             // Follow CollapseProject and only keep going if the collected Projects
-            // do not have common non-deterministic expressions.
-            if (!hasCommonNonDeterministic(fields, aliases)) {
+            // do not have common non-deterministic expressions, or do not have equal to/more than
+            // maximum allowed common outputs.
+            if (!hasCommonNonDeterministic(fields, aliases)
+                || !moreThanMaxAllowedCommonOutput(fields, aliases)) {

Review comment:
       nit, you may want to move `||` into line 157.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502211597



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       Seems like we need to change to `transformDown` and take a recursive approach like `RemoveRedundantProjects` and `recursiveRemoveSort` for collapsing `Project`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712378514


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708846661






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707537801


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r508936383



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -760,12 +757,43 @@ object CollapseProject extends Rule[LogicalPlan] {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {

Review comment:
       How about following the `ScanOperation` code structure like this?
   ```
       case p1 @ Project(_, p2: Project) =>
         if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
             moreThanMaxAllowedCommonOutput(p1.projectList, p2.projectList)) {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706412910


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34200/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502200210



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       This seems can be fixed by using `transformDown` instead? Seems to me `CollapseProject` is not necessarily to use `transformUp` if I don't miss anything. cc @cloud-fan @maropu 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk edited a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk edited a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704073167


   Perhaps the max number of common expressions is not the best metric here?
   
   Lets compare two cases:
   1) On the lower project you have a `JsonToStructs` and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.
   
   2) On the lower project you have two `JsonToStructs` and on upper Project you get 2 fields from both stucts. This would also mean 2 redundant computations and the "metric" you are looking at is 2.
   
   Adding more `JsonToStructs` to the lower level would increase the number redundant computations without increasing the max value.
   So as an alternative I would propose "the number of redundant computations" (sum of values in the `exprMap` minus its size) as a metric to use.
   
   Although I must admit, that in that case we might cache more values for the number of extra computations we save.
   So both of them have their benefits.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704664898


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34091/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714805870


   **[Test build #130170 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130170/testReport)** for PR 29950 at commit [`4990375`](https://github.com/apache/spark/commit/4990375fd1ed3acbd56e00817e7fd167a63031d6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706412931


   Merged build finished. Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r508932535



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +

Review comment:
       (Just a comment) Even if we set `spark.sql.optimizer.excludedRules` to `CollapseProject`, it seems like Spark still respects this value in `ScanOperation`? That behaviour might be okay, but it looks a bit weird to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522667998



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,59 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query1 = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))

Review comment:
       indentation? Maybe, the following is better?
   ```scala
   - val query1 = relation.select(
   -   JsonToStructs(schema, options, 'json).as("struct"))
   -   .select(
   + val query1 = relation.select(JsonToStructs(schema, options, 'json).as("struct"))
   +   .select(
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708699404


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34366/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714806533






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726420465


   **[Test build #131024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131024/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733897186






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733450335


   **[Test build #131729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131729/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725982325


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35589/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r510351431



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")

Review comment:
       I guess not. We might have at lease few common expressions in collapsed projection. If set to 1, any duplicated expression is not allowed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725931070


   **[Test build #130983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130983/testReport)** for PR 29950 at commit [`58e71d8`](https://github.com/apache/spark/commit/58e71d85fb0752a2963b9e343d321cf1c6a8e634).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708681878






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714683506


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34777/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725866882


   **[Test build #130972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130972/testReport)** for PR 29950 at commit [`58e71d8`](https://github.com/apache/spark/commit/58e71d85fb0752a2963b9e343d321cf1c6a8e634).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-703996949


   **[Test build #129432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129432/testReport)** for PR 29950 at commit [`f418714`](https://github.com/apache/spark/commit/f41871490da4df5d7cb5d352f5ef3795e8a6f625).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706421966


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714699733






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726089718






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704651539


   **[Test build #129486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129486/testReport)** for PR 29950 at commit [`1b567e7`](https://github.com/apache/spark/commit/1b567e7a46dbaf926e708fdd9f3228530fd2c5d9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712413195






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya closed pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya closed pull request #29950:
URL: https://github.com/apache/spark/pull/29950


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725898127






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733543105






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725931070


   **[Test build #130983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130983/testReport)** for PR 29950 at commit [`58e71d8`](https://github.com/apache/spark/commit/58e71d85fb0752a2963b9e343d321cf1c6a8e634).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522669924



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -758,6 +756,42 @@ object CollapseProject extends Rule[LogicalPlan] with AliasHelper {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          moreThanMaxAllowedCommonOutput(p1.projectList, p2.projectList)) {
+        p1
+      } else {
+        collapseProjects(
+          p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList)))
+      }
+    case _ => plan
+  }
+
+  private def collectAliases(projectList: Seq[NamedExpression]): AttributeMap[Alias] = {
+    AttributeMap(projectList.collect {
+      case a: Alias => a.toAttribute -> a
+    })
+  }
+
+  // Whether the largest times common outputs from lower operator used in upper operators is
+  // larger than allowed.

Review comment:
       `than allowed` -> `than the maximum`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725866882


   **[Test build #130972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130972/testReport)** for PR 29950 at commit [`58e71d8`](https://github.com/apache/spark/commit/58e71d85fb0752a2963b9e343d321cf1c6a8e634).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733542143


   **[Test build #131729 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131729/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502990724



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -216,7 +216,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
     // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
     Batch("Check Cartesian Products", Once,
       CheckCartesianProducts) :+
-    Batch("RewriteSubquery", Once,
+    Batch("RewriteSubquery", fixedPoint,

Review comment:
       `FixedPoint(1)` instead? Also, could you leave comments about why we use it instead of `Once`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -760,12 +756,43 @@ object CollapseProject extends Rule[LogicalPlan] {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {
+        p1
+      } else {
+        collapseProjects(
+          p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList)))
+      }
+    case _ => plan
+  }
+
   private def collectAliases(projectList: Seq[NamedExpression]): AttributeMap[Alias] = {
     AttributeMap(projectList.collect {
       case a: Alias => a.toAttribute -> a
     })
   }
 
+  // Counts for the largest times common outputs from lower operator are used in upper operators.
+  private def getLargestNumOfCommonOutput(

Review comment:
       we cannot share the code between this and `moreThanMaxAllowedCommonOutput`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +

Review comment:
       `a common expression` -> `common input attributes`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +
+        "can be collapsed into upper Project from lower Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging.")
+      .version("3.1.0")
+      .intConf

Review comment:
       Could you add `checkValue`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +

Review comment:
       nit: `the maximum allowed number of a common expression can be collapsed into upper Project from lower Project ...` => `the maximum allowed number of common input attributes when collapsing adjacent Projects ...`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704561415


   > Perhaps the max number of common expressions is not the best metric here?
   > 
   > Lets compare two cases:
   > 
   >     1. On the lower project you have a `JsonToStructs` and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.
   > 
   >     2. On the lower project you have two `JsonToStructs` and on upper Project you get 2 fields from both stucts. This would also mean 2 redundant computations and the "metric" you are looking at is 2.
   > 
   > 
   > Adding more `JsonToStructs` to the lower level would increase the number redundant computations without increasing the max value.
   > So as an alternative I would propose "the number of redundant computations" (sum of values in the `exprMap` minus its size) as a metric to use.
   > 
   > Although I must admit, that in that case we might cache more values for the number of extra computations we save.
   > So both of them have their benefits.
   
   Yes, in the case you add the number of redundant computations each time you add one more `JsonToStructs`. But overall it should not cause noticeable performance issue because you simply three times the running cost of `JsonToStructs` (assume each `JsonToStructs` has 2 redundant computations).
   
   The number of redundant computations is misleading. If we have 100 `JsonToStructs` in lower project, each of them has 2 redundant computations in upper project, it doesn't mean we 100 times the running cost of `JsonToStructs`. In other words, it is hard to tell the performance difference between 10 and 20 redundant computations. If the redundant computations come from the same expression, then we have 10 times v.s. 20 times running cost, but if they are from 10 expressions? We might just have 2 to 3 times running cost.
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726445280


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35630/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725982285


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35589/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r509596491



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +126,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+      expr: Seq[NamedExpression],
+      aliases: AttributeMap[Expression]): Boolean = {
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    expr.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    val commonOutputs = if (exprMap.size > 0) {
+      exprMap.maxBy(_._2)._2
+    } else {
+      0
+    }
+
+    commonOutputs > SQLConf.get.maxCommonExprsInCollapseProject
+  }
+
   private def collectProjectsAndFilters(plan: LogicalPlan): ScanReturnType = {
     plan match {
       case Project(fields, child) =>
         collectProjectsAndFilters(child) match {
           case Some((_, filters, other, aliases)) =>
             // Follow CollapseProject and only keep going if the collected Projects
-            // do not have common non-deterministic expressions.
-            if (!hasCommonNonDeterministic(fields, aliases)) {
+            // do not have common non-deterministic expressions, and do not have more than
+            // maximum allowed common outputs.
+            if (!hasCommonNonDeterministic(fields, aliases) &&
+                !moreThanMaxAllowedCommonOutput(fields, aliases)) {

Review comment:
       So before we have clearer idea about `PhysicalOperation` and `ScanOperation`, I will leave `PhysicalOperation` as it is. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725865698


   Oops. Sorry for being late, @viirya .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r503666979



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +

Review comment:
       Not exactly the same, but I revised the doc.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +
+        "can be collapsed into upper Project from lower Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging.")
+      .version("3.1.0")
+      .intConf

Review comment:
       Added, thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704662008


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708706324






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r510350405



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -760,12 +757,43 @@ object CollapseProject extends Rule[LogicalPlan] {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {

Review comment:
       Good. Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706452537


   **[Test build #129607 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129607/testReport)** for PR 29950 at commit [`4bf4dc2`](https://github.com/apache/spark/commit/4bf4dc252c663afce30848481ec3887a70f7b611).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706393667


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34198/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733543105






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704662008






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704651539


   **[Test build #129486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129486/testReport)** for PR 29950 at commit [`1b567e7`](https://github.com/apache/spark/commit/1b567e7a46dbaf926e708fdd9f3228530fd2c5d9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704664916






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-739698601


   I recently generalized subexpression elimination feature to interpreted project and predicate. So now both whole-stage codegen and interpreted execution support subexpression elimination that could avoid the performance issue caused by embedding common expressions from collapsing projects.
   
   That's said, I think this patch is less useful for now. I'm closing it now.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725912414


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130972/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r509093034



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +

Review comment:
       hm I see. Yea, updating the doc sounds nice to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706462627


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34210/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725865830


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522519283



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1963,6 +1963,27 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging. Note that normally " +
+        "in whole-stage codegen Project operator will de-duplicate expressions internally, " +
+        "but in edge cases Spark cannot do whole-stage codegen and fallback to interpreted " +
+        "mode. In such cases, users can use this config to avoid duplicate expressions. " +
+        "Note that even users exclude `CollapseProject` rule using " +
+        "`spark.sql.optimizer.excludedRules`, at physical planning phase Spark will still " +
+        "collapse projections. This config is also effective on collapsing projections in " +
+        "the physical planning.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of maxCommonExprsInCollapseProject must be larger than zero.")
+      .createWithDefault(20)

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522667998



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,59 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query1 = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))

Review comment:
       indentation? Maybe, the following is better?
   ```scala
   - val query1 = relation.select(
   -   JsonToStructs(schema, options, 'json).as("struct"))
   -   .select(
   + val query1 = relation.select(JsonToStructs(schema, options, 'json).as("struct"))
   +  .select(
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708637957


   **[Test build #129760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129760/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708846651


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708706324


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706386926


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34198/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708846661






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706486885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502641830



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],
+       aliases: AttributeMap[Expression]): Boolean = {
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    expr.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    val commonOutputs = if (exprMap.size > 0) {
+      exprMap.maxBy(_._2)._2
+    } else {
+      0
+    }
+
+    commonOutputs > maxCommonExprs
+  }
+
   private def collectProjectsAndFilters(plan: LogicalPlan): ScanReturnType = {
     plan match {
       case Project(fields, child) =>
         collectProjectsAndFilters(child) match {
           case Some((_, filters, other, aliases)) =>
             // Follow CollapseProject and only keep going if the collected Projects
-            // do not have common non-deterministic expressions.
-            if (!hasCommonNonDeterministic(fields, aliases)) {
+            // do not have common non-deterministic expressions, or do not have equal to/more than
+            // maximum allowed common outputs.
+            if (!hasCommonNonDeterministic(fields, aliases)
+                || !moreThanMaxAllowedCommonOutput(fields, aliases)) {

Review comment:
       sure. thanks.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -732,10 +732,12 @@ object ColumnPruning extends Rule[LogicalPlan] {
  *    `GlobalLimit(LocalLimit)` pattern is also considered.

Review comment:
       Yea, updated. thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502206637



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       hm, it sounds fine, too. Rather, it seems a top-down transformation can collapse projects in one shot just like `RemoveRedundantProjects`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712415058


   Thanks for pinging me again, @viirya . I'll review this PR once more today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704013622






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r508936383



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -760,12 +757,43 @@ object CollapseProject extends Rule[LogicalPlan] {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {

Review comment:
       How about writing it as follows, following the ScanOperation way?
   ```
       case p1 @ Project(_, p2: Project) =>
         if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
             moreThanMaxAllowedCommonOutput(p1.projectList, p2.projectList)) {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726521998






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704665641


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34092/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707855552


   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502204585



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       If there is a choin of projects: `P1(P2(P3(P4(...))))`, then using `transformDown` will firstly merge `P1` and `P2` into `P12` and then it will go to its child `P3` and merge it with `P4` into `P34`. Only on the second iteration it will merge all 4 of these.
   
   In this case we want to merge `P123` and then see, that we can't merge with `P4` because we would exceed `maxCommonExprsInCollapseProject`.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r508931332



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +126,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+      expr: Seq[NamedExpression],
+      aliases: AttributeMap[Expression]): Boolean = {
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    expr.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    val commonOutputs = if (exprMap.size > 0) {
+      exprMap.maxBy(_._2)._2
+    } else {
+      0
+    }
+
+    commonOutputs > SQLConf.get.maxCommonExprsInCollapseProject
+  }
+
   private def collectProjectsAndFilters(plan: LogicalPlan): ScanReturnType = {
     plan match {
       case Project(fields, child) =>
         collectProjectsAndFilters(child) match {
           case Some((_, filters, other, aliases)) =>
             // Follow CollapseProject and only keep going if the collected Projects
-            // do not have common non-deterministic expressions.
-            if (!hasCommonNonDeterministic(fields, aliases)) {
+            // do not have common non-deterministic expressions, and do not have more than
+            // maximum allowed common outputs.
+            if (!hasCommonNonDeterministic(fields, aliases) &&
+                !moreThanMaxAllowedCommonOutput(fields, aliases)) {

Review comment:
       `PhysicalOperation` does not have the same issue?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -760,12 +757,43 @@ object CollapseProject extends Rule[LogicalPlan] {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {
+        p1
+      } else {
+        collapseProjects(
+          p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList)))
+      }
+    case _ => plan
+  }
+
   private def collectAliases(projectList: Seq[NamedExpression]): AttributeMap[Alias] = {
     AttributeMap(projectList.collect {
       case a: Alias => a.toAttribute -> a
     })
   }
 
+  // Counts for the largest times common outputs from lower operator are used in upper operators.
+  private def getLargestNumOfCommonOutput(
+      upper: Seq[NamedExpression], lower: Seq[NamedExpression]): Int = {
+    val aliases = collectAliases(lower)
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    upper.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    if (exprMap.size > 0) {

Review comment:
       `nonEmpty` instead?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")

Review comment:
       If we set this value to 1, all the existing tests can pass?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2567,6 +2568,51 @@ class DataFrameSuite extends QueryTest
     val df = l.join(r, $"col2" === $"col4", "LeftOuter")
     checkAnswer(df, Row("2", "2"))
   }
+
+  test("SPARK-32945: Avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c long, d string")
+
+    withTable("test_table") {
+      val jsonDF = Seq("""{"a":1, "b":2, "c": 123, "d": "test"}""").toDF("json")

Review comment:
       super nit: `DF` -> `Df`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +

Review comment:
       (Just a comment) If we set `CollapseProject` in `spark.sql.optimizer.excludedRules`, it seems like Spark still respects this value in `ScanOperation`? That behaviour might be okay, but it looks a bit weird to me.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging. Note that normally " +
+        "in whole-stage codegen Project operator will de-duplicate expressions internally, " +
+        "but in edge cases Spark cannot do whole-stage codegen and fallback to interpreted " +
+        "mode. In such cases, users can use this config to avoid duplicate expressions")
+      .version("3.1.0")
+      .intConf
+      .checkValue(m => m > 0, "maxCommonExprsInCollapseProject must be larger than zero.")

Review comment:
       nit: `m => m > 0` -> `_ > 0`

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -760,12 +757,43 @@ object CollapseProject extends Rule[LogicalPlan] {
       s.copy(child = p2.copy(projectList = buildCleanedProjectList(l1, p2.projectList)))
   }
 
+  private def collapseProjects(plan: LogicalPlan): LogicalPlan = plan match {
+    case p1 @ Project(_, p2: Project) =>
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+          getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {

Review comment:
       How about following the `ScanOperation` condition like this?
   ```
       case p1 @ Project(_, p2: Project) =>
         val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
   
         if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
             moreThanMaxAllowedCommonOutput(p1.projectList, p2.projectList)) {
   ```

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +126,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+      expr: Seq[NamedExpression],
+      aliases: AttributeMap[Expression]): Boolean = {
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    expr.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    val commonOutputs = if (exprMap.size > 0) {

Review comment:
       ```
   
       if (exprMap.nonEmpty) {
         exprMap.maxBy(_._2)._2 > SQLConf.get.maxCommonExprsInCollapseProject
       } else {
         false
       }
   ```

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,23 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging. Note that normally " +
+        "in whole-stage codegen Project operator will de-duplicate expressions internally, " +
+        "but in edge cases Spark cannot do whole-stage codegen and fallback to interpreted " +
+        "mode. In such cases, users can use this config to avoid duplicate expressions")
+      .version("3.1.0")
+      .intConf
+      .checkValue(m => m > 0, "maxCommonExprsInCollapseProject must be larger than zero.")

Review comment:
       nit: `maxCommonExprsInCollapseProject` -> `The value of maxCommonExprsInCollapseProject `




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708637957


   **[Test build #129760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129760/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522511754



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1963,6 +1963,27 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging. Note that normally " +
+        "in whole-stage codegen Project operator will de-duplicate expressions internally, " +
+        "but in edge cases Spark cannot do whole-stage codegen and fallback to interpreted " +
+        "mode. In such cases, users can use this config to avoid duplicate expressions. " +
+        "Note that even users exclude `CollapseProject` rule using " +
+        "`spark.sql.optimizer.excludedRules`, at physical planning phase Spark will still " +
+        "collapse projections. This config is also effective on collapsing projections in " +
+        "the physical planning.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of maxCommonExprsInCollapseProject must be larger than zero.")
+      .createWithDefault(20)

Review comment:
       If possible, can we introduce this configuration with `Int.MaxValue` in `3.1.0` first? We can reduce it later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707537451


   **[Test build #129727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129727/testReport)** for PR 29950 at commit [`9bfafc7`](https://github.com/apache/spark/commit/9bfafc75d66987fd550052e6b3d53efe703b55ec).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502206361



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       I think, that correct way would be using `transformDown` in a similar manner to `recursiveRemoveSort` in #21072. 
   So basically when you hit the first `Project`, then you collect all consecutive `Projects` until you hit the `maxCommonExprsInCollapseProject` limit and merge them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708875711






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733890894


   **[Test build #131780 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131780/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712509801






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502204585



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       If there is a chain of projects: `P1(P2(P3(P4(...))))`, then using `transformDown` will firstly merge `P1` and `P2` into `P12` and then it will go to its child `P3` and merge it with `P4` into `P34`. Only on the second iteration it will merge all 4 of these.
   
   In this case we want to merge `P123` and then see, that we can't merge with `P4` because we would exceed `maxCommonExprsInCollapseProject`.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708875711






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706421978


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129597/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708704286


   **[Test build #129773 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129773/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706462631






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-733728541


   **[Test build #131780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131780/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704073167


   Perhaps the max number of common expressions is not the best metric here?
   
   Lets compare two cases:
   1) On the lower project you have a `JsonToStructs` and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.
   
   2) On the lower project you have two `JsonToStructs` and on upper Project you get 2 fields from both stuct. This would also mean 2 redundant computations and the "metric" you are looking at is 2.
   
   Adding more `JsonToStructs` to the lower level would increase the number redundant computations without increasing the max value.
   So as an alternative I would propose "the number of redundant computations" (sum of values in the `exprMap` minus its size) as a metric to use.
   
   Although I must admit, that in that case we might cache more values for the number of extra computations we save.
   So both of them have their benefits.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704661942


   **[Test build #129485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129485/testReport)** for PR 29950 at commit [`98843dd`](https://github.com/apache/spark/commit/98843dd0d86c47fce7871c56d1b94e67976d473e).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704649309


   **[Test build #129485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129485/testReport)** for PR 29950 at commit [`98843dd`](https://github.com/apache/spark/commit/98843dd0d86c47fce7871c56d1b94e67976d473e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725982304






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714659652


   **[Test build #130170 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130170/testReport)** for PR 29950 at commit [`4990375`](https://github.com/apache/spark/commit/4990375fd1ed3acbd56e00817e7fd167a63031d6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704649309


   **[Test build #129485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129485/testReport)** for PR 29950 at commit [`98843dd`](https://github.com/apache/spark/commit/98843dd0d86c47fce7871c56d1b94e67976d473e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706404974


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34200/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706421860


   **[Test build #129597 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129597/testReport)** for PR 29950 at commit [`43eb50d`](https://github.com/apache/spark/commit/43eb50d57eb334395c9cbe7b193de28644c62e68).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708706335


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34366/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726089718






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726445294


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35630/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707488433


   **[Test build #129727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129727/testReport)** for PR 29950 at commit [`9bfafc7`](https://github.com/apache/spark/commit/9bfafc75d66987fd550052e6b3d53efe703b55ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706388940


   **[Test build #129597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129597/testReport)** for PR 29950 at commit [`43eb50d`](https://github.com/apache/spark/commit/43eb50d57eb334395c9cbe7b193de28644c62e68).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707537801






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726434977


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35630/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522519012



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1963,6 +1963,27 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging. Note that normally " +
+        "in whole-stage codegen Project operator will de-duplicate expressions internally, " +
+        "but in edge cases Spark cannot do whole-stage codegen and fallback to interpreted " +
+        "mode. In such cases, users can use this config to avoid duplicate expressions. " +
+        "Note that even users exclude `CollapseProject` rule using " +
+        "`spark.sql.optimizer.excludedRules`, at physical planning phase Spark will still " +
+        "collapse projections. This config is also effective on collapsing projections in " +
+        "the physical planning.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of maxCommonExprsInCollapseProject must be larger than zero.")
+      .createWithDefault(20)

Review comment:
       Sure. It is safer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707537810


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129727/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712413195






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r503666896



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -216,7 +216,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
     // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
     Batch("Check Cartesian Products", Once,
       CheckCartesianProducts) :+
-    Batch("RewriteSubquery", Once,
+    Batch("RewriteSubquery", fixedPoint,

Review comment:
       Added one comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707511535


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34333/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r499976977



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -766,6 +768,23 @@ object CollapseProject extends Rule[LogicalPlan] {
     })
   }

Review comment:
       We could extend to other cases like `case p @ Project(_, agg: Aggregate)`, but leave it untouched for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706486697


   **[Test build #129607 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129607/testReport)** for PR 29950 at commit [`4bf4dc2`](https://github.com/apache/spark/commit/4bf4dc252c663afce30848481ec3887a70f7b611).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r501976989



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       When using the dataset API, then it would be very common to chain `withColumn` calls:
   ```
   dataset
       .withColumn("json", ...)
       .withColumn("a", col("json").getField("a"))
       .withColumn("b", col("json").getField("b"))
       .withColumn("c", col("json").getField("c"))
   ```
   
   In that case the query should look more like this:
   ```
           val query = relation
             .select('json, JsonToStructs(schema, options, 'json).as("struct"))
             .select('json, 'struct, GetStructField('struct, 0).as("a"))
             .select('json, 'struct, 'a, GetStructField('struct, 1).as("b"))
             .select('json, 'struct, 'a, 'b, GetStructField('struct, 2).as("c"))
             .analyze
   ```
   
   The `CollapseProject` rule uses `transformUp`. It seems that in that case we do not get the expected results from this optimization.

##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       If there is a choin of projects: `P1(P2(P3(P4(...))))`, then using `transformDown` will firstly merge `P1` and `P2` into `P12` and then it will go to its child `P3` and merge it with `P4` into `P34`. Only on the second iteration it will merge all 4 of these.
   
   In this case we want to merge `P123` and then see, that we can't merge with `P4` because we would exceed `maxCommonExprsInCollapseProject`.
   
   

##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       I think, that correct way would be using `transformDown` in a similar manner to `recursiveRemoveSort` in #21072. 
   So basically when you hit the first `Project`, then you collect all consecutive `Projects` until you hit the `maxCommonExprsInCollapseProject` limit and merge them.

##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       If there is a chain of projects: `P1(P2(P3(P4(...))))`, then using `transformDown` will firstly merge `P1` and `P2` into `P12` and then it will go to its child `P3` and merge it with `P4` into `P34`. Only on the second iteration it will merge all 4 of these.
   
   In this case we want to merge `P123` and then see, that we can't merge with `P4` because we would exceed `maxCommonExprsInCollapseProject`.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-741660327


   Thanks @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708681888


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129760/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-716501947


   No more comment now. cc: @dongjoon-hyun @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707511553






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502206637



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       hm, it sounds fine, too. Rather, it seems a top-down transformation can collapse projects in one shot just like `RemoveRedundantProjects`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706462631






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706412943


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34200/
   Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502642551



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +
+        "can be collapsed into upper Project from lower Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging.")
+      .version("3.1.0")
+      .intConf
+      .createWithDefault(20)

Review comment:
       No, just decide a number that seems bad for repeating an expression.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-703996949


   **[Test build #129432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129432/testReport)** for PR 29950 at commit [`f418714`](https://github.com/apache/spark/commit/f41871490da4df5d7cb5d352f5ef3795e8a6f625).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726420465


   **[Test build #131024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131024/testReport)** for PR 29950 at commit [`bbaae3e`](https://github.com/apache/spark/commit/bbaae3ed9b96ca517ec5435838ac37feb578c959).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725912399






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714806533






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725912399


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704680301


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129486/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522668629



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1963,6 +1963,27 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of common input expression " +
+        "from lower Project when being collapsed into upper Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging. Note that normally " +
+        "in whole-stage codegen Project operator will de-duplicate expressions internally, " +
+        "but in edge cases Spark cannot do whole-stage codegen and fallback to interpreted " +
+        "mode. In such cases, users can use this config to avoid duplicate expressions. " +
+        "Note that even users exclude `CollapseProject` rule using " +
+        "`spark.sql.optimizer.excludedRules`, at physical planning phase Spark will still " +
+        "collapse projections. This config is also effective on collapsing projections in " +
+        "the physical planning.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of maxCommonExprsInCollapseProject must be larger than zero.")

Review comment:
       `larger than zero` -> `positive`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-708683946


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r522665920



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -724,20 +725,17 @@ object ColumnPruning extends Rule[LogicalPlan] {
 /**
  * Combines two [[Project]] operators into one and perform alias substitution,
  * merging the expressions into one single expression for the following cases.
- * 1. When two [[Project]] operators are adjacent.
+ * 1. When two [[Project]] operators are adjacent, if the number of common expressions in the
+ *    combined [[Project]] is not more than `spark.sql.optimizer.maxCommonExprsInCollapseProject`.
  * 2. When two [[Project]] operators have LocalLimit/Sample/Repartition operator between them
  *    and the upper project consists of the same number of columns which is equal or aliasing.
  *    `GlobalLimit(LocalLimit)` pattern is also considered.
  */
 object CollapseProject extends Rule[LogicalPlan] with AliasHelper {
 
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case p1 @ Project(_, p2: Project) =>
-      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
-        p1
-      } else {
-        p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
-      }
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {

Review comment:
       Is there a reason to change from `transformUp` to `transformDown`? If the all test passed, it would be safe if we keep the original one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704009560


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34039/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704076658


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-726445288


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714659652


   **[Test build #130170 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130170/testReport)** for PR 29950 at commit [`4990375`](https://github.com/apache/spark/commit/4990375fd1ed3acbd56e00817e7fd167a63031d6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704013622






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r509035110



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +126,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+      expr: Seq[NamedExpression],
+      aliases: AttributeMap[Expression]): Boolean = {
+    val exprMap = mutable.HashMap.empty[Attribute, Int]
+
+    expr.foreach(_.collect {
+      case a: Attribute if aliases.contains(a) => exprMap.update(a, exprMap.getOrElse(a, 0) + 1)
+    })
+
+    val commonOutputs = if (exprMap.size > 0) {
+      exprMap.maxBy(_._2)._2
+    } else {
+      0
+    }
+
+    commonOutputs > SQLConf.get.maxCommonExprsInCollapseProject
+  }
+
   private def collectProjectsAndFilters(plan: LogicalPlan): ScanReturnType = {
     plan match {
       case Project(fields, child) =>
         collectProjectsAndFilters(child) match {
           case Some((_, filters, other, aliases)) =>
             // Follow CollapseProject and only keep going if the collected Projects
-            // do not have common non-deterministic expressions.
-            if (!hasCommonNonDeterministic(fields, aliases)) {
+            // do not have common non-deterministic expressions, and do not have more than
+            // maximum allowed common outputs.
+            if (!hasCommonNonDeterministic(fields, aliases) &&
+                !moreThanMaxAllowedCommonOutput(fields, aliases)) {

Review comment:
       I'm wondering how we decide to use `PhysicalOperation` or `ScanOperation`?
   
   I can see `ScanOperation` is used for re-construct Project/Filter around scan node. `PhysicalOperation` has mixed use cases, including re-construct Project/Filter around scan like `ScanOperation` (so can we replace `PhysicalOperation` with `ScanOperation` in these cases?), and others e.g. `OptimizeMetadataOnlyQuery`.
   
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706383984






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502313107



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -732,10 +732,12 @@ object ColumnPruning extends Rule[LogicalPlan] {
  *    `GlobalLimit(LocalLimit)` pattern is also considered.
  */
 object CollapseProject extends Rule[LogicalPlan] {
-
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case p1 @ Project(_, p2: Project) =>
-      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+        getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {

Review comment:
       indentation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-712381928


   **[Test build #130013 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130013/testReport)** for PR 29950 at commit [`c2c01e4`](https://github.com/apache/spark/commit/c2c01e4a1c3d4e5e68058e1739df9323e74acf3e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706393689






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-706368133


   **[Test build #129595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129595/testReport)** for PR 29950 at commit [`76509b3`](https://github.com/apache/spark/commit/76509b36cf2afef38a2794422cf5684f86bd1cf6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-714699703


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34777/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704680294


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502313752



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
     }.exists(!_.deterministic))
   }
 
+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],

Review comment:
       indentation? It seems that there is one more space here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r501976989



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
##########
@@ -170,4 +171,34 @@ class CollapseProjectSuite extends PlanTest {
     val expected = Sample(0.0, 0.6, false, 11L, relation.select('a as 'c)).analyze
     comparePlans(optimized, expected)
   }
+
+  test("SPARK-32945: avoid collapsing projects if reaching max allowed common exprs") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c string, d long")
+
+    Seq("1", "2", "3", "4").foreach { maxCommonExprs =>
+      withSQLConf(SQLConf.MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT.key -> maxCommonExprs) {
+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze

Review comment:
       When using the dataset API, then it would be very common to chain `withColumn` calls:
   ```
   dataset
       .withColumn("json", ...)
       .withColumn("a", col("json").getField("a"))
       .withColumn("b", col("json").getField("b"))
       .withColumn("c", col("json").getField("c"))
   ```
   
   In that case the query should look more like this:
   ```
           val query = relation
             .select('json, JsonToStructs(schema, options, 'json).as("struct"))
             .select('json, 'struct, GetStructField('struct, 0).as("a"))
             .select('json, 'struct, 'a, GetStructField('struct, 1).as("b"))
             .select('json, 'struct, 'a, 'b, GetStructField('struct, 2).as("c"))
             .analyze
   ```
   
   The `CollapseProject` rule uses `transformUp`. It seems that in that case we do not get the expected results from this optimization.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-725898102


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35578/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704680127


   **[Test build #129486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129486/testReport)** for PR 29950 at commit [`1b567e7`](https://github.com/apache/spark/commit/1b567e7a46dbaf926e708fdd9f3228530fd2c5d9).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29950:
URL: https://github.com/apache/spark/pull/29950#discussion_r502311982



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1926,6 +1926,19 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
+  val MAX_COMMON_EXPRS_IN_COLLAPSE_PROJECT =
+    buildConf("spark.sql.optimizer.maxCommonExprsInCollapseProject")
+      .doc("An integer number indicates the maximum allowed number of a common expression " +
+        "can be collapsed into upper Project from lower Project by optimizer rule " +
+        "`CollapseProject`. Normally `CollapseProject` will collapse adjacent Project " +
+        "and merge expressions. But in some edge cases, expensive expressions might be " +
+        "duplicated many times in merged Project by this optimization. This config sets " +
+        "a maximum number. Once an expression is duplicated more than this number " +
+        "if merging two Project, Spark SQL will skip the merging.")
+      .version("3.1.0")
+      .intConf
+      .createWithDefault(20)

Review comment:
       Just a question. Is there a reason to choose `20`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704013611


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34039/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704032203


   Related to https://github.com/apache/spark/pull/29094 ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org