You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/13 17:29:53 UTC

[GitHub] [spark] sarutak opened a new pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

sarutak opened a new pull request #33981:
URL: https://github.com/apache/spark/pull/33981


   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This PR fixes a perf issue in `SchemaPruning` when a struct has many fields (e.g. >10K fields).
   The root cause is `SchemaPruning.sortLeftFieldsByRight` does N^2 order searching.
   ```
    val filteredRightFieldNames = rightStruct.fieldNames
       .filter(name => leftStruct.fieldNames.exists(resolver(_, name))) 
   ```
   
   To fix this issue, this PR proposes to use `TreeMap` to expect Log(N) order searching.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   To fix a perf issue.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No. The logic should be identical.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   I confirmed that the following micro benchmark finishes within a few seconds.
   ```
   import org.apache.spark.sql.catalyst.expressions.SchemaPruning
   import org.apache.spark.sql.types._
   
   var struct1 = new StructType()
   (1 to 50000).foreach { i =>
     struct1 = struct1.add(new StructField(i + "", IntegerType))
   }
   
   var struct2 = new StructType()
   (50001 to 100000).foreach { i =>
     struct2 = struct2.add(new StructField(i + "", IntegerType))
   }
   
   SchemaPruning.sortLeftFieldsByRight(struct1, struct2)
   SchemaPruning.sortLeftFieldsByRight(struct2, struct2)
   ```
   
   The correctness should be checked by existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918965059


   **[Test build #143250 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143250/testReport)** for PR 33981 at commit [`bff1505`](https://github.com/apache/spark/commit/bff1505812e521e2365abd6d7ae8787298c71389).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918629850


   **[Test build #143216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143216/testReport)** for PR 33981 at commit [`80bf864`](https://github.com/apache/spark/commit/80bf864d0aca12f883ff1dad70ba81e6c73aedd7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918944846


   **[Test build #143251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143251/testReport)** for PR 33981 at commit [`eeb0dc6`](https://github.com/apache/spark/commit/eeb0dc68a9f55a1ea2fa1270f9831144a39ec67f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918782803


   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918990912


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47754/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919037223


   **[Test build #143254 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143254/testReport)** for PR 33981 at commit [`eeb0dc6`](https://github.com/apache/spark/commit/eeb0dc68a9f55a1ea2fa1270f9831144a39ec67f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919021724


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919165999


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143251/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918636832


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143216/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918965626






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919627843


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #33981:
URL: https://github.com/apache/spark/pull/33981#discussion_r707963448



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -65,16 +67,19 @@ object SchemaPruning extends SQLConfHelper {
           sortLeftFieldsByRight(leftValueType, rightValueType),
           containsNull)
       case (leftStruct: StructType, rightStruct: StructType) =>
-        val resolver = conf.resolver
-        val filteredRightFieldNames = rightStruct.fieldNames
-          .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))
-        val sortedLeftFields = filteredRightFieldNames.map { fieldName =>
-          val resolvedLeftStruct = leftStruct.find(p => resolver(p.name, fieldName)).get
-          val leftFieldType = resolvedLeftStruct.dataType
-          val rightFieldType = rightStruct(fieldName).dataType
-          val sortedLeftFieldType = sortLeftFieldsByRight(leftFieldType, rightFieldType)
-          StructField(fieldName, sortedLeftFieldType, nullable = resolvedLeftStruct.nullable,
-            metadata = resolvedLeftStruct.metadata)
+        val leftStructTreeMap =
+          TreeMap(leftStruct.map(_.name).zip(leftStruct): _*)(conf.fieldNameOrdering)

Review comment:
       I wonder if we can use a `HashMap` for this:
   
   ```scala
     private def formatFieldName(name: String): String =
       if (conf.caseSensitiveAnalysis) name else name.toLowerCase(Locale.ROOT)
   
     ...
   
     val leftStructTreeMap =
       HashMap(leftStruct.map(f => formatFieldName(f.name)).zip(leftStruct): _*)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918984433


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47754/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919070183


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47757/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919275298


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143254/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918473336


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918965626


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143250/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918899848


   **[Test build #143250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143250/testReport)** for PR 33981 at commit [`bff1505`](https://github.com/apache/spark/commit/bff1505812e521e2365abd6d7ae8787298c71389).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918473336


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] taroplus commented on a change in pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

taroplus commented on a change in pull request #33981:
URL: https://github.com/apache/spark/pull/33981#discussion_r707887231



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -65,16 +67,19 @@ object SchemaPruning extends SQLConfHelper {
           sortLeftFieldsByRight(leftValueType, rightValueType),
           containsNull)
       case (leftStruct: StructType, rightStruct: StructType) =>
-        val resolver = conf.resolver
-        val filteredRightFieldNames = rightStruct.fieldNames
-          .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))
-        val sortedLeftFields = filteredRightFieldNames.map { fieldName =>
-          val resolvedLeftStruct = leftStruct.find(p => resolver(p.name, fieldName)).get
-          val leftFieldType = resolvedLeftStruct.dataType
-          val rightFieldType = rightStruct(fieldName).dataType
-          val sortedLeftFieldType = sortLeftFieldsByRight(leftFieldType, rightFieldType)
-          StructField(fieldName, sortedLeftFieldType, nullable = resolvedLeftStruct.nullable,
-            metadata = resolvedLeftStruct.metadata)
+        val leftStructTreeMap =

Review comment:
       based on my use-cases, often times left and right are the same thing, can we have a condition like
   ```
   case (_, _) if left == right => left
   ```
   i'm not sure if we need to recreate the whole thing for those situations




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918944846


   **[Test build #143251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143251/testReport)** for PR 33981 at commit [`eeb0dc6`](https://github.com/apache/spark/commit/eeb0dc68a9f55a1ea2fa1270f9831144a39ec67f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918936268


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47753/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919165999


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143251/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918466217


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918942411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47753/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919270920


   **[Test build #143254 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143254/testReport)** for PR 33981 at commit [`eeb0dc6`](https://github.com/apache/spark/commit/eeb0dc68a9f55a1ea2fa1270f9831144a39ec67f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919275298


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143254/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919076897


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47757/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918930808


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47753/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919163817


   **[Test build #143251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143251/testReport)** for PR 33981 at commit [`eeb0dc6`](https://github.com/apache/spark/commit/eeb0dc68a9f55a1ea2fa1270f9831144a39ec67f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918990884


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47754/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918424792


   **[Test build #143216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143216/testReport)** for PR 33981 at commit [`80bf864`](https://github.com/apache/spark/commit/80bf864d0aca12f883ff1dad70ba81e6c73aedd7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918636832


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143216/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918424792


   **[Test build #143216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143216/testReport)** for PR 33981 at commit [`80bf864`](https://github.com/apache/spark/commit/80bf864d0aca12f883ff1dad70ba81e6c73aedd7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919076939


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47757/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] taroplus commented on a change in pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

taroplus commented on a change in pull request #33981:
URL: https://github.com/apache/spark/pull/33981#discussion_r707887231



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -65,16 +67,19 @@ object SchemaPruning extends SQLConfHelper {
           sortLeftFieldsByRight(leftValueType, rightValueType),
           containsNull)
       case (leftStruct: StructType, rightStruct: StructType) =>
-        val resolver = conf.resolver
-        val filteredRightFieldNames = rightStruct.fieldNames
-          .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))
-        val sortedLeftFields = filteredRightFieldNames.map { fieldName =>
-          val resolvedLeftStruct = leftStruct.find(p => resolver(p.name, fieldName)).get
-          val leftFieldType = resolvedLeftStruct.dataType
-          val rightFieldType = rightStruct(fieldName).dataType
-          val sortedLeftFieldType = sortLeftFieldsByRight(leftFieldType, rightFieldType)
-          StructField(fieldName, sortedLeftFieldType, nullable = resolvedLeftStruct.nullable,
-            metadata = resolvedLeftStruct.metadata)
+        val leftStructTreeMap =

Review comment:
       based on my use-cases, often times left and right are the same thing, can we have a condition like
   ```
   case _ if left == right => left
   ```
   i'm not sure if we need to recreate the whole thing for those situations




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] taroplus commented on a change in pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

taroplus commented on a change in pull request #33981:
URL: https://github.com/apache/spark/pull/33981#discussion_r707907989



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3768,6 +3768,24 @@ class SQLConf extends Serializable with Logging {
     }
   }
 
+  /**
+   * Returns the [[Comparator]] for the current configuration,
+   * which can be used to compare two identifiers.
+   */
+  private[sql] def comparator: Comparator = {
+    if (caseSensitiveAnalysis) {
+      org.apache.spark.sql.catalyst.analysis.caseSensitiveComparator
+    } else {
+      org.apache.spark.sql.catalyst.analysis.caseInsensitiveComparator
+    }
+  }
+
+  private[sql] val fieldNameOrdering = new Ordering[String] {

Review comment:
       not sure how much difference it makes, however checking config (caseSensitiveAnalysis) for every evaluation doesn't look necessary, can we do create an ordering per config ...
   ```
     private[sql] def fieldNameOrdering: Ordering[String] = {
       if (caseSensitiveAnalysis) {
         (a: String, b: String) => a.compareTo(b)
       } else {
         (a: String, b: String) => a.compareToIgnoreCase(b)
       }
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #33981:
URL: https://github.com/apache/spark/pull/33981


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919037223


   **[Test build #143254 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143254/testReport)** for PR 33981 at commit [`eeb0dc6`](https://github.com/apache/spark/commit/eeb0dc68a9f55a1ea2fa1270f9831144a39ec67f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sarutak commented on a change in pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

sarutak commented on a change in pull request #33981:
URL: https://github.com/apache/spark/pull/33981#discussion_r707941027



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3768,6 +3768,24 @@ class SQLConf extends Serializable with Logging {
     }
   }
 
+  /**
+   * Returns the [[Comparator]] for the current configuration,
+   * which can be used to compare two identifiers.
+   */
+  private[sql] def comparator: Comparator = {
+    if (caseSensitiveAnalysis) {
+      org.apache.spark.sql.catalyst.analysis.caseSensitiveComparator
+    } else {
+      org.apache.spark.sql.catalyst.analysis.caseInsensitiveComparator
+    }
+  }
+
+  private[sql] val fieldNameOrdering = new Ordering[String] {

Review comment:
       Thank you. It's reasonable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918942411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47753/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918473173


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-918899848


   **[Test build #143250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143250/testReport)** for PR 33981 at commit [`bff1505`](https://github.com/apache/spark/commit/bff1505812e521e2365abd6d7ae8787298c71389).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sarutak commented on a change in pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

sarutak commented on a change in pull request #33981:
URL: https://github.com/apache/spark/pull/33981#discussion_r707987742



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -65,16 +67,19 @@ object SchemaPruning extends SQLConfHelper {
           sortLeftFieldsByRight(leftValueType, rightValueType),
           containsNull)
       case (leftStruct: StructType, rightStruct: StructType) =>
-        val resolver = conf.resolver
-        val filteredRightFieldNames = rightStruct.fieldNames
-          .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))
-        val sortedLeftFields = filteredRightFieldNames.map { fieldName =>
-          val resolvedLeftStruct = leftStruct.find(p => resolver(p.name, fieldName)).get
-          val leftFieldType = resolvedLeftStruct.dataType
-          val rightFieldType = rightStruct(fieldName).dataType
-          val sortedLeftFieldType = sortLeftFieldsByRight(leftFieldType, rightFieldType)
-          StructField(fieldName, sortedLeftFieldType, nullable = resolvedLeftStruct.nullable,
-            metadata = resolvedLeftStruct.metadata)
+        val leftStructTreeMap =
+          TreeMap(leftStruct.map(_.name).zip(leftStruct): _*)(conf.fieldNameOrdering)

Review comment:
       Seems better. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #33981: [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #33981:
URL: https://github.com/apache/spark/pull/33981#issuecomment-919076939


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47757/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org