You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/20 01:45:35 UTC

[GitHub] [spark] viirya opened a new pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

viirya opened a new pull request #29812:
URL: https://github.com/apache/spark/pull/29812


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch proposes to add more optimization to `WithFields` expression chain.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   `WithFields` can manipulate complex nested data, but using `WithFields` can easily create inefficient expression chain. We should optimize it further.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Unit test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695811553


   **[Test build #128918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128918/testReport)** for PR 29812 at commit [`74cf2dd`](https://github.com/apache/spark/commit/74cf2ddd5524a9572193803bf9d753b8c888af7c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695701924


   **[Test build #128902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128902/testReport)** for PR 29812 at commit [`8662d76`](https://github.com/apache/spark/commit/8662d7669f7632d4efb9a201253bc73823c5dfdd).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695474669






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-697020330


   **[Test build #128996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128996/testReport)** for PR 29812 at commit [`cb8872c`](https://github.com/apache/spark/commit/cb8872c3e203cf2ae5025f836b1eb53004b0e7f8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695749093


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492414412



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>

Review comment:
       We don't run this rule just once, so the order should be fine.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {

Review comment:
       ok.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before optimizer.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. But for some cases, before we extend `WithFields`, the expression tree might be very complex. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before optimizer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810832






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695753753






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711369142


   **[Test build #129964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129964/testReport)** for PR 29812 at commit [`82ad8c8`](https://github.com/apache/spark/commit/82ad8c85701aa93b2601e6720ee2ee7661030596).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695472488


   **[Test build #128901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128901/testReport)** for PR 29812 at commit [`4182311`](https://github.com/apache/spark/commit/418231180deee557cd86cc54972400a7349fe13f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493073140



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,32 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
+  lazy val resolver = SQLConf.get.resolver
+
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (newNames.find(resolver(_, name)).isEmpty) {

Review comment:
       Added a set for case-sensitive case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711447824






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-713217464


   Thanks @maropu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493111576



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Actually I'd like to run these rules to simplify `WithFields` tree early in analysis stage. After #29587, I thought that it is very likely to write bad `WithFields` tree. Once hitting that, it is very hard to debug and the analyzer/optimizer spend a lot of time traversing expression tree. So I think it is very useful keep this rule to simplify the expression tree, but I don't think we want to do `ReplaceWithFieldsExpression` in analysis stage.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695512726


   **[Test build #128904 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128904/testReport)** for PR 29812 at commit [`0217130`](https://github.com/apache/spark/commit/0217130b38ccb390bb385a750768c8ef985051a8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696460856


   **[Test build #128955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128955/testReport)** for PR 29812 at commit [`00acff9`](https://github.com/apache/spark/commit/00acff9c5bacfee8fb86a5e95ac97bf8a1f3cd76).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492425468



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before optimizer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696517456


   **[Test build #128955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128955/testReport)** for PR 29812 at commit [`00acff9`](https://github.com/apache/spark/commit/00acff9c5bacfee8fb86a5e95ac97bf8a1f3cd76).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fqaiser94 commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
fqaiser94 commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493106387



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Okay, so I took a look at the PR you linked and left a related [comment](https://github.com/apache/spark/pull/29587/files#r493098043) there. I don't think you actually need this optimization for #29587  
   
   This optimization is only useful if someone uses `WithFields` to update the same field multiple times. However, it would simply be better to **not** update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark. 
   
   Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle anyway, I think we could just move up the `Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression)` to get the same benefit (which is just simplified tree). What do you reckon?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711421951






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695472488


   **[Test build #128901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128901/testReport)** for PR 29812 at commit [`4182311`](https://github.com/apache/spark/commit/418231180deee557cd86cc54972400a7349fe13f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493124788



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       I'm fine to wait until #29795.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695749093






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695702580


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492425468



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. But for some cases, before we extend `WithFields`, the expression tree might be very complex. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before entering optimizer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696461134






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r507239350



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala
##########
@@ -17,19 +17,61 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.UpdateFields
+import java.util.Locale
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, UpdateFields, WithField}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[UpdateFields]] expression into a single [[UpdateFields]] expression.
+ * Optimizes [[UpdateFields]] expression chains.
  */
-object CombineUpdateFields extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+object OptimizeUpdateFields extends Rule[LogicalPlan] {
+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=

Review comment:
       For example, for case-sensitive case, `fieldOps.map(_.asInstanceOf[WithField].name).distinct.length != fieldOps.length` should be used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695514525






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695474669






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696460856


   **[Test build #128955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128955/testReport)** for PR 29812 at commit [`00acff9`](https://github.com/apache/spark/commit/00acff9c5bacfee8fb86a5e95ac97bf8a1f3cd76).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695811553






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695839443






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696518033






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fqaiser94 commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
fqaiser94 commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r491929282



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {

Review comment:
       should use `resolver` here otherwise I think we will have correct-ness issues.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       For my understanding, can you explain how we expect to benefit from this optimization? 
   
   I ask because we do this kind of deduplication inside of `WithFields` already as part of the `foldLeft` operation [here](https://github.com/apache/spark/blob/d01594e8d186e63a6c3ce361e756565e830d5237/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L578). It will only keep the last `valExpr` for each `name`. So I think the optimized logical plan will be the same with or without this optimization in all scenarios? CMIIW

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>

Review comment:
       could this `case` statement be after the next `case` statement? So that we combine the chains first before deduplicating?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Okay, so I took a look at the PR you linked and left a related [comment](https://github.com/apache/spark/pull/29587/files#r493098043) there. I don't think you actually need this optimization for `unionByName`. 
   
   This optimization is only useful if someone uses `WithFields` to update the same field multiple times. However, it would simply be better to **not** update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark. 
   
   Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle, I think we could just move up the `Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression)` to get the same benefit (which is just simplified tree). What do you reckon?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Okay, so I took a look at the PR you linked and left a related [comment](https://github.com/apache/spark/pull/29587/files#r493098043) there. I don't think you actually need this optimization for `unionByName`. 
   
   This optimization is only useful if someone uses `WithFields` to update the same field multiple times. However, it would simply be better to **not** update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark. 
   
   Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle anyway, I think we could just move up the `Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression)` to get the same benefit (which is just simplified tree). What do you reckon?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Okay, so I took a look at the PR you linked and left a related [comment](https://github.com/apache/spark/pull/29587/files#r493098043) there. I don't think you actually need this optimization for #29587  
   
   This optimization is only useful if someone uses `WithFields` to update the same field multiple times. However, it would simply be better to **not** update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark. 
   
   Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle anyway, I think we could just move up the `Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression)` to get the same benefit (which is just simplified tree). What do you reckon?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       ahh I see, yes, in the analysis stage this would likely be helpful!
   
   Okay in that case, could this PR wait till #29795 goes in? I'm refactoring `WithFields` so this optimization would need to change anyway. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r507214407



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala
##########
@@ -17,19 +17,61 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.UpdateFields
+import java.util.Locale
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, UpdateFields, WithField}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[UpdateFields]] expression into a single [[UpdateFields]] expression.
+ * Optimizes [[UpdateFields]] expression chains.
  */
-object CombineUpdateFields extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+object OptimizeUpdateFields extends Rule[LogicalPlan] {
+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=

Review comment:
       The `if` condition should cover both case-sensitive and case-insensitive cases now. I compare names in lowercase in the condition.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711625346






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695522714


   **[Test build #128905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128905/testReport)** for PR 29812 at commit [`74cf2dd`](https://github.com/apache/spark/commit/74cf2ddd5524a9572193803bf9d753b8c888af7c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711447823






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695492705


   **[Test build #128902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128902/testReport)** for PR 29812 at commit [`8662d76`](https://github.com/apache/spark/commit/8662d7669f7632d4efb9a201253bc73823c5dfdd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711421951






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-697110388


   **[Test build #128996 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128996/testReport)** for PR 29812 at commit [`cb8872c`](https://github.com/apache/spark/commit/cb8872c3e203cf2ae5025f836b1eb53004b0e7f8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810677






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696460856






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696518033






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fqaiser94 commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
fqaiser94 commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r491929282



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {

Review comment:
       should use `resolver` here otherwise I think we will have correct-ness issues.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       For my understanding, can you explain how we expect to benefit from this optimization? 
   
   I ask because we do this kind of deduplication inside of `WithFields` already as part of the `foldLeft` operation [here](https://github.com/apache/spark/blob/d01594e8d186e63a6c3ce361e756565e830d5237/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L578). It will only keep the last `valExpr` for each `name`. So I think the optimized logical plan will be the same with or without this optimization in all scenarios? CMIIW

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>

Review comment:
       could this `case` statement be after the next `case` statement? So that we combine the chains first before deduplicating?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810755


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695811698






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711623497


   **[Test build #129972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129972/testReport)** for PR 29812 at commit [`f41900c`](https://github.com/apache/spark/commit/f41900cb3855d56dcb0ba7d5f448482baef1c3fa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-713215442


   (late LGTM)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r507238910



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala
##########
@@ -17,19 +17,61 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.UpdateFields
+import java.util.Locale
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, UpdateFields, WithField}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[UpdateFields]] expression into a single [[UpdateFields]] expression.
+ * Optimizes [[UpdateFields]] expression chains.
  */
-object CombineUpdateFields extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+object OptimizeUpdateFields extends Rule[LogicalPlan] {
+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=

Review comment:
       No, what I meant is that we don't need to execute line 39~69 at all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696461134






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695915645


   LGTM, cc @fqaiser94


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695492705


   **[Test build #128902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128902/testReport)** for PR 29812 at commit [`8662d76`](https://github.com/apache/spark/commit/8662d7669f7632d4efb9a201253bc73823c5dfdd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695749096


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128901/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810677


   GitHub Actions was passed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711369142


   **[Test build #129964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129964/testReport)** for PR 29812 at commit [`82ad8c8`](https://github.com/apache/spark/commit/82ad8c85701aa93b2601e6720ee2ee7661030596).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695494603






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695811553


   **[Test build #128918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128918/testReport)** for PR 29812 at commit [`74cf2dd`](https://github.com/apache/spark/commit/74cf2ddd5524a9572193803bf9d753b8c888af7c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711471650


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34580/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810832






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r508195964



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala
##########
@@ -17,19 +17,68 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.UpdateFields
+import java.util.Locale
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, UpdateFields, WithField}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[UpdateFields]] expression into a single [[UpdateFields]] expression.
+ * Optimizes [[UpdateFields]] expression chains.
  */
-object CombineUpdateFields extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+object OptimizeUpdateFields extends Rule[LogicalPlan] {
+  private def canOptimize(names: Seq[String]): Boolean = {
+    if (SQLConf.get.caseSensitiveAnalysis) {
+      names.distinct.length != names.length
+    } else {
+      names.map(_.toLowerCase(Locale.ROOT)).distinct.length != names.length
+    }
+  }
+
+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        canOptimize(fieldOps.map(_.asInstanceOf[WithField].name)) =>
+      val caseSensitive = SQLConf.get.caseSensitiveAnalysis
+
+      val withFields = fieldOps.map(_.asInstanceOf[WithField])
+      val names = withFields.map(_.name)
+      val values = withFields.map(_.valExpr)
+
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+
+      if (caseSensitive) {
+        names.zip(values).reverse.foreach { case (name, value) =>

Review comment:
       I wonder if we could just do like: `collection.immutable.ListMap(names.zip(values): _*)` which will keep the last win here and keep the order of fields to use later. But I guess it's no big deal. Just saying.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-712320670


   Merged to master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695702607


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128902/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696460856






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695494603






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-697020330


   **[Test build #128996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128996/testReport)** for PR 29812 at commit [`cb8872c`](https://github.com/apache/spark/commit/cb8872c3e203cf2ae5025f836b1eb53004b0e7f8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711413494






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492414412



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>

Review comment:
       We don't run this rule just once, so the order should be fine.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {

Review comment:
       ok.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before optimizer.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. But for some cases, before we extend `WithFields`, the expression tree might be very complex. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before optimizer.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. But for some cases, before we extend `WithFields`, the expression tree might be very complex. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before entering optimizer.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,32 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
+  lazy val resolver = SQLConf.get.resolver
+
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (newNames.find(resolver(_, name)).isEmpty) {

Review comment:
       Added a set for case-sensitive case.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Actually I'd like to run these rules to simplify `WithFields` tree early in analysis stage. After #29587, I thought that it is very likely to write bad `WithFields` tree. Once hitting that, it is very hard to debug and the analyzer/optimizer spend a lot of time traversing expression tree. So I think it is very useful keep this rule to simplify the expression tree, but I don't think we want to do `ReplaceWithFieldsExpression` in analysis stage.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Actually I'd like to run these rules to simplify `WithFields` tree early in analysis stage. During fixing scale issue of #29587, I thought that it is very likely to write bad `WithFields` tree. Once hitting that, it is very hard to debug and the analyzer/optimizer spend a lot of time traversing expression tree. So I think it is very useful keep this rule to simplify the expression tree, but I don't think we want to do `ReplaceWithFieldsExpression` in analysis stage.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       I'm fine to wait until #29795.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711421949


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696461134






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695512726






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695839268


   **[Test build #128918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128918/testReport)** for PR 29812 at commit [`74cf2dd`](https://github.com/apache/spark/commit/74cf2ddd5524a9572193803bf9d753b8c888af7c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695514525






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711438636






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r507324046



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala
##########
@@ -17,19 +17,61 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.UpdateFields
+import java.util.Locale
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, UpdateFields, WithField}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[UpdateFields]] expression into a single [[UpdateFields]] expression.
+ * Optimizes [[UpdateFields]] expression chains.
  */
-object CombineUpdateFields extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+object OptimizeUpdateFields extends Rule[LogicalPlan] {
+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=

Review comment:
       I see. Updated. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810832






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711461907


   **[Test build #129972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129972/testReport)** for PR 29812 at commit [`f41900c`](https://github.com/apache/spark/commit/f41900cb3855d56dcb0ba7d5f448482baef1c3fa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711403356


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34570/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fqaiser94 commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
fqaiser94 commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r491929282



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {

Review comment:
       should use `resolver` here otherwise I think we will have correct-ness issues.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       For my understanding, can you explain how we expect to benefit from this optimization? 
   
   I ask because we do this kind of deduplication inside of `WithFields` already as part of the `foldLeft` operation [here](https://github.com/apache/spark/blob/d01594e8d186e63a6c3ce361e756565e830d5237/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L578). It will only keep the last `valExpr` for each `name`. So I think the optimized logical plan will be the same with or without this optimization in all scenarios? CMIIW

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>

Review comment:
       could this `case` statement be after the next `case` statement? So that we combine the chains first before deduplicating?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711461907


   **[Test build #129972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129972/testReport)** for PR 29812 at commit [`f41900c`](https://github.com/apache/spark/commit/f41900cb3855d56dcb0ba7d5f448482baef1c3fa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492561560



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,32 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
+  lazy val resolver = SQLConf.get.resolver
+
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (newNames.find(resolver(_, name)).isEmpty) {

Review comment:
       this is a bit inefficient. Shall we build a set with lowercased names if case sensitivity is false?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695753902






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r507238910



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala
##########
@@ -17,19 +17,61 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.UpdateFields
+import java.util.Locale
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, UpdateFields, WithField}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[UpdateFields]] expression into a single [[UpdateFields]] expression.
+ * Optimizes [[UpdateFields]] expression chains.
  */
-object CombineUpdateFields extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+object OptimizeUpdateFields extends Rule[LogicalPlan] {
+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=

Review comment:
       No, what I mean is that we don't need to execute line 39~69 at all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711447288


   **[Test build #129967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129967/testReport)** for PR 29812 at commit [`38bdefd`](https://github.com/apache/spark/commit/38bdefde56e039c0380a290b939a82f206487d80).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492425468



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       You are right. It is eventually the same. But for some cases, before we extend `WithFields`, the expression tree might be very complex. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of `WithFields` expression tree.
   
   I will run these rules in #29587 to simplify expression tree before optimizer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711477116






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711438383


   **[Test build #129964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129964/testReport)** for PR 29812 at commit [`82ad8c8`](https://github.com/apache/spark/commit/82ad8c85701aa93b2601e6720ee2ee7661030596).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-697020765






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493111576



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Actually I'd like to run these rules to simplify `WithFields` tree early in analysis stage. During fixing scale issue of #29587, I thought that it is very likely to write bad `WithFields` tree. Once hitting that, it is very hard to debug and the analyzer/optimizer spend a lot of time traversing expression tree. So I think it is very useful keep this rule to simplify the expression tree, but I don't think we want to do `ReplaceWithFieldsExpression` in analysis stage.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-712607215


   @HyukjinKwon Thanks for the suggestion. I updated this PR description.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-697110934






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fqaiser94 commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
fqaiser94 commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493106387



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Okay, so I took a look at the PR you linked and left a related [comment](https://github.com/apache/spark/pull/29587/files#r493098043) there. I don't think you actually need this optimization for `unionByName`. 
   
   This optimization is only useful if someone uses `WithFields` to update the same field multiple times. However, it would simply be better to **not** update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark. 
   
   Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle, I think we could just move up the `Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression)` to get the same benefit (which is just simplified tree). What do you reckon?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492561560



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,32 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
+  lazy val resolver = SQLConf.get.resolver
+
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (newNames.find(resolver(_, name)).isEmpty) {

Review comment:
       this is a bit inefficient. Shall we build a set with lowercased names if case sensitivity is false?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711477098


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34580/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-697020765






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711413149


   **[Test build #129967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129967/testReport)** for PR 29812 at commit [`38bdefd`](https://github.com/apache/spark/commit/38bdefde56e039c0380a290b939a82f206487d80).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r492414412



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>

Review comment:
       We don't run this rule just once, so the order should be fine.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {

Review comment:
       ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695811698






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695702580






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695811553


   **[Test build #128918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128918/testReport)** for PR 29812 at commit [`74cf2dd`](https://github.com/apache/spark/commit/74cf2ddd5524a9572193803bf9d753b8c888af7c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711418705


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-696461134






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fqaiser94 commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
fqaiser94 commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493114672



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       ahh I see, yes, in the analysis stage this would likely be helpful!
   
   Okay in that case, could this PR wait till #29795 goes in? I'm refactoring `WithFields` so this optimization would need to change anyway. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711625346






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fqaiser94 commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
fqaiser94 commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r493106387



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala
##########
@@ -17,16 +17,29 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.WithFields
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField, WithFields}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
 
 
 /**
- * Combines all adjacent [[WithFields]] expression into a single [[WithFields]] expression.
+ * Optimizes [[WithFields]] expression chains.
  */
-object CombineWithFields extends Rule[LogicalPlan] {
+object OptimizeWithFields extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {
+          newNames += name
+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)

Review comment:
       Okay, so I took a look at the PR you linked and left a related [comment](https://github.com/apache/spark/pull/29587/files#r493098043) there. I don't think you actually need this optimization for `unionByName`. 
   
   This optimization is only useful if someone uses `WithFields` to update the same field multiple times. However, it would simply be better to **not** update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark. 
   
   Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle anyway, I think we could just move up the `Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression)` to get the same benefit (which is just simplified tree). What do you reckon?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711413490


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34570/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810703


   cc @cloud-fan @dongjoon-hyun @maropu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29812:
URL: https://github.com/apache/spark/pull/29812


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29812:
URL: https://github.com/apache/spark/pull/29812#discussion_r507209842



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala
##########
@@ -17,19 +17,61 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.UpdateFields
+import java.util.Locale
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, UpdateFields, WithField}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
 
 
 /**
- * Combines all adjacent [[UpdateFields]] expression into a single [[UpdateFields]] expression.
+ * Optimizes [[UpdateFields]] expression chains.
  */
-object CombineUpdateFields extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+object OptimizeUpdateFields extends Rule[LogicalPlan] {
+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=

Review comment:
       In case of `case-sensitive` mode, this seems to allow unnecessarily computation. Can we improve this `if` statement to handle both `case-sensitive` and `case-insensitive` together?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-697110934






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711413494






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695810832






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711477116






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695839443






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695748996


   **[Test build #128901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128901/testReport)** for PR 29812 at commit [`4182311`](https://github.com/apache/spark/commit/418231180deee557cd86cc54972400a7349fe13f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695915645


   LGTM, cc @fqaiser94


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695753920


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128904/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711413149


   **[Test build #129967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129967/testReport)** for PR 29812 at commit [`38bdefd`](https://github.com/apache/spark/commit/38bdefde56e039c0380a290b939a82f206487d80).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-711438636






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29812: [SPARK-32941][SQL] Optimize WithFields expression chain

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-695753902






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29812: [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29812:
URL: https://github.com/apache/spark/pull/29812#issuecomment-712573143


   @viirya, BTW, do you mind fixing the PR description to explain what this PR specifically improves?
   
   > This patch proposes to add more optimization to `UpdateFields` expression chain.
   
   Seems like this PR does not describe what exactly optimizes. Is my understanding correct that this PR proposes two separate optimizations?
   
   - Deduplicates `WithField` at `UpdateFields`
   - Respect nullability in input struct at `GetStructField(UpdateFields(..., struct))`, and unwrap if-else.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org