You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/26 20:01:42 UTC

[GitHub] [spark] sunchao opened a new pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

sunchao opened a new pull request #32354:
URL: https://github.com/apache/spark/pull/32354


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error message, please read the guideline first:
        https://spark.apache.org/error-message-guidelines.html
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   Retain column metadata during the process of nested column pruning, when constructing `StructField`. 
   
   To test the above change, this also added the logic of column projection in `InMemoryTable`. Without the fix `DSV2CharVarcharDDLTestSuite` will fail.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   The column metadata is used in a few places such as re-constructing CHAR/VARCHAR information such as in [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901). Therefore, we should retain the info during nested column pruning.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834153699


   **[Test build #138241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138241/testReport)** for PR 32354 at commit [`50dc32d`](https://github.com/apache/spark/commit/50dc32d89d3129a8e3d8e5019c4d7888ede30b4f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-830781264


   **[Test build #138142 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138142/testReport)** for PR 32354 at commit [`7fd0360`](https://github.com/apache/spark/commit/7fd0360a2b536744635bbd6b58f20279883a557e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834625687


   **[Test build #138259 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138259/testReport)** for PR 32354 at commit [`a67dde0`](https://github.com/apache/spark/commit/a67dde0b2ddea64bf1eb5bb67e0b1cd938146ee6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-830270564


   kindly ping @viirya @cloud-fan @yaooqinn - it'd be great to get a review from you :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827156937


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42491/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832363551


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138160/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834191115


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42763/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827204785


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42492/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832328037


   **[Test build #138160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138160/testReport)** for PR 32354 at commit [`50dc32d`](https://github.com/apache/spark/commit/50dc32d89d3129a8e3d8e5019c4d7888ede30b4f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r624355101



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -127,7 +128,8 @@ object SchemaPruning extends SQLConfHelper {
   private def getRootFields(expr: Expression): Seq[RootField] = {
     expr match {
       case att: Attribute =>
-        RootField(StructField(att.name, att.dataType, att.nullable), derivedFromAtt = true) :: Nil
+        RootField(StructField(att.name, att.dataType, att.nullable, att.metadata),

Review comment:
       Good point. Let me add a test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r625977975



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala
##########
@@ -504,9 +509,39 @@ private class BufferedRowsReader(
     index < partition.rows.length
   }
 
-  override def get(): InternalRow = addMetadata(partition.rows(index))
+  override def get(): InternalRow = {
+    val originalRow = partition.rows(index)
+    val values = new Array[Any](nonMetadataColumns.length)
+    nonMetadataColumns.zipWithIndex.foreach { case (col, idx) =>
+      values(idx) = extractFieldValue(col, tableSchema, originalRow)
+    }
+    addMetadata(new GenericInternalRow(values))
+  }
 
   override def close(): Unit = {}
+
+  private def extractFieldValue(
+      field: StructField,
+      schema: StructType,
+      row: InternalRow): Any = {
+    val index = schema.fieldIndex(field.name)

Review comment:
       Good question. Looking at `PushdownUtils.pruneColumns`, I see that we apply `SQLConf.resolver` when nested column pruning is enabled, but seems not so when it is disabled. IMO perhaps we should have better contract between Spark and data source implementors w.r.t `SupportsPushDownRequiredColumns.pruneColumns`, and Spark should guarantee that the `requiredSchema` passed in to the method should be a "subset" of the relation's schema (e.g., table schema).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya closed pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya closed pull request #32354:
URL: https://github.com/apache/spark/pull/32354


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827147500


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42491/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832357747


   **[Test build #138160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138160/testReport)** for PR 32354 at commit [`50dc32d`](https://github.com/apache/spark/commit/50dc32d89d3129a8e3d8e5019c4d7888ede30b4f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832394425


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42681/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827283692


   **[Test build #137972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137972/testReport)** for PR 32354 at commit [`f90d882`](https://github.com/apache/spark/commit/f90d8822ccad024f9b95356736b0e83e4a3a06df).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834334579


   **[Test build #138241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138241/testReport)** for PR 32354 at commit [`50dc32d`](https://github.com/apache/spark/commit/50dc32d89d3129a8e3d8e5019c4d7888ede30b4f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834361684


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138241/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827286545


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827124720


   **[Test build #137971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137971/testReport)** for PR 32354 at commit [`36b8f8e`](https://github.com/apache/spark/commit/36b8f8eacee376743b81541fb5d38cd38a1d9b16).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-835163544


   Thanks @viirya for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827286545


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-835115714


   Thanks @sunchao! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827183325


   **[Test build #137972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137972/testReport)** for PR 32354 at commit [`f90d882`](https://github.com/apache/spark/commit/f90d8822ccad024f9b95356736b0e83e4a3a06df).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827200133


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42492/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832328037


   **[Test build #138160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138160/testReport)** for PR 32354 at commit [`50dc32d`](https://github.com/apache/spark/commit/50dc32d89d3129a8e3d8e5019c4d7888ede30b4f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832394425


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42681/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao edited a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao edited a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827151576


   @viirya without the change in `InMemoryTable` the test will not fail. This is because `InMemoryTable` doesn't have column pruning at the moment and it just return the table schema (which has the metadata) as the read schema in `InMemoryBatchScan`, but a more realistic data source would use the `requestedSchema` (e.g., [Iceberg](https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L132)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827223645


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137971/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827156937


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42491/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834635302


   I will go to merge this once CI passes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834625687


   **[Test build #138259 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138259/testReport)** for PR 32354 at commit [`a67dde0`](https://github.com/apache/spark/commit/a67dde0b2ddea64bf1eb5bb67e0b1cd938146ee6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r625530081



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala
##########
@@ -237,29 +237,29 @@ class InMemoryTable(
     private var schema: StructType = tableSchema
 
     override def build: Scan =
-      new InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema)
+      new InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, tableSchema)
 
     override def pruneColumns(requiredSchema: StructType): Unit = {
-      // if metadata columns are projected, return the table schema and metadata columns
-      val hasMetadataColumns = requiredSchema.map(_.name).exists(metadataColumnNames.contains)
-      if (hasMetadataColumns) {
-        schema = StructType(tableSchema ++ metadataColumnNames
-            .flatMap(name => metadataColumns.find(_.name == name))
-            .map(col => StructField(col.name, col.dataType, col.isNullable)))
-      }
+      schema = StructType(requiredSchema.filter { f =>
+        (metadataColumnNames ++ tableSchema.map(_.name)).contains(f.name)

Review comment:
       `(metadataColumnNames ++ tableSchema.map(_.name))` can be out of `filter` loop?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r628351335



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala
##########
@@ -504,9 +508,39 @@ private class BufferedRowsReader(
     index < partition.rows.length
   }
 
-  override def get(): InternalRow = addMetadata(partition.rows(index))
+  override def get(): InternalRow = {
+    val originalRow = partition.rows(index)
+    val values = new Array[Any](nonMetadataColumns.length)
+    nonMetadataColumns.zipWithIndex.foreach { case (col, idx) =>
+      values(idx) = extractFieldValue(col, tableSchema, originalRow)
+    }
+    addMetadata(new GenericInternalRow(values))
+  }
 
   override def close(): Unit = {}
+
+  private def extractFieldValue(
+      field: StructField,
+      schema: StructType,
+      row: InternalRow): Any = {
+    val index = schema.fieldIndex(field.name)
+    field.dataType match {
+      case StructType(fields) =>
+        val childRow = row.toSeq(schema)(index).asInstanceOf[InternalRow]
+        if (childRow == null) {
+          return null
+        }

Review comment:
       nit: we can use `row.isNullAt(index)`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-830793653






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834667061


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42781/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834667061


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42781/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827183325


   **[Test build #137972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137972/testReport)** for PR 32354 at commit [`f90d882`](https://github.com/apache/spark/commit/f90d8822ccad024f9b95356736b0e83e4a3a06df).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827139573


   > Retain column metadata during the process of nested column pruning, when constructing StructField.
   To test the above change, this also added the logic of column projection in InMemoryTable. Without the fix DSV2CharVarcharDDLTestSuite will fail.
   
   Does it mean if only applying `SchemaPruning` change without `InMemoryTable`, `DSV2CharVarcharDDLTestSuite` will fail?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834667006






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-830830241


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42663/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827154440


   Also it seems [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901) (#30918) doesn't work for nested type. For instance, this test:
   ```scala
     test("SPARK-XXXXX: create table like should should not change table's schema (nested type)") {
       withTable( "tt") {
         sql(s"CREATE TABLE tt(s1 struct<i: CHAR(5), j: int>, s2 struct<c: VARCHAR(4), d: int>) " +
           s"USING $format")
         withView("t") {
           sql("CREATE VIEW t AS SELECT s1.i, s2.c FROM tt")
           checkTableSchemaTypeStr(Seq(Row("char(5)"), Row("varchar(4)")))
         }
       }
     }
   ```
   doesn't work even with this fix. It seems this is related to how the metadata is handled: in nested schema the metadata is associated with the top-level field (e.g., `s1`, `s2`) instead of leaf nodes (e.g., `i`, `c`). cc @yaooqinn @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834188152






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834191115


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42763/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r625992301



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala
##########
@@ -504,9 +509,39 @@ private class BufferedRowsReader(
     index < partition.rows.length
   }
 
-  override def get(): InternalRow = addMetadata(partition.rows(index))
+  override def get(): InternalRow = {
+    val originalRow = partition.rows(index)
+    val values = new Array[Any](nonMetadataColumns.length)
+    nonMetadataColumns.zipWithIndex.foreach { case (col, idx) =>
+      values(idx) = extractFieldValue(col, tableSchema, originalRow)
+    }
+    addMetadata(new GenericInternalRow(values))
+  }
 
   override def close(): Unit = {}
+
+  private def extractFieldValue(
+      field: StructField,
+      schema: StructType,
+      row: InternalRow): Any = {
+    val index = schema.fieldIndex(field.name)

Review comment:
       Oh actually, in both cases the `requiredSchema` are normalized according to case-sensitivity. This is done in `V2ScanRelationPushDown` (see `normalizedProjects` there.) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834361684


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138241/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834148230


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834811277


   **[Test build #138259 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138259/testReport)** for PR 32354 at commit [`a67dde0`](https://github.com/apache/spark/commit/a67dde0b2ddea64bf1eb5bb67e0b1cd938146ee6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834153699


   **[Test build #138241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138241/testReport)** for PR 32354 at commit [`50dc32d`](https://github.com/apache/spark/commit/50dc32d89d3129a8e3d8e5019c4d7888ede30b4f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834812445


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138259/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827223645


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137971/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-830485283


   > Oh, you mean that by applying `InMemoryTable` related change here, `DSV2CharVarcharDDLTestSuite` will fail because nested column pruning doesn't retain metadata now. Right?
   
   Yup exactly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827213425


   **[Test build #137971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137971/testReport)** for PR 32354 at commit [`36b8f8e`](https://github.com/apache/spark/commit/36b8f8eacee376743b81541fb5d38cd38a1d9b16).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827151576


   @viirya without the change in `InMemoryTable` the test will not fail. This is because `InMemoryTable` doesn't have column pruning at the moment and it just return the table schema as the read schema in `InMemoryBatchScan`, but a more realistic data source would use the `requestedSchema` (e.g., [Iceberg](https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L132)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832363551


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138160/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-830830241


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42663/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r625985268



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala
##########
@@ -504,9 +509,39 @@ private class BufferedRowsReader(
     index < partition.rows.length
   }
 
-  override def get(): InternalRow = addMetadata(partition.rows(index))
+  override def get(): InternalRow = {
+    val originalRow = partition.rows(index)
+    val values = new Array[Any](nonMetadataColumns.length)
+    nonMetadataColumns.zipWithIndex.foreach { case (col, idx) =>
+      values(idx) = extractFieldValue(col, tableSchema, originalRow)
+    }
+    addMetadata(new GenericInternalRow(values))
+  }
 
   override def close(): Unit = {}
+
+  private def extractFieldValue(
+      field: StructField,
+      schema: StructType,
+      row: InternalRow): Any = {
+    val index = schema.fieldIndex(field.name)
+    field.dataType match {
+      case StructType(fields) =>
+        val childRow = row.toSeq(schema)(index).asInstanceOf[InternalRow]
+        if (childRow == null) {
+          return null
+        }
+        val childSchema = schema(index).dataType.asInstanceOf[StructType]
+        val resultValue = new Array[Any](fields.length)
+        fields.zipWithIndex.foreach { case (childField, idx) =>
+          val childValue = extractFieldValue(childField, childSchema, childRow)
+          resultValue(idx) = childValue
+        }
+        new GenericInternalRow(resultValue)
+      case dt =>
+        row.get(index, CharVarcharUtils.replaceCharVarcharWithString(dt))
+    }

Review comment:
       Yes I think this is not required actually. My bad.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834812445


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138259/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827204785


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42492/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827124720


   **[Test build #137971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137971/testReport)** for PR 32354 at commit [`36b8f8e`](https://github.com/apache/spark/commit/36b8f8eacee376743b81541fb5d38cd38a1d9b16).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r625531893



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala
##########
@@ -504,9 +509,39 @@ private class BufferedRowsReader(
     index < partition.rows.length
   }
 
-  override def get(): InternalRow = addMetadata(partition.rows(index))
+  override def get(): InternalRow = {
+    val originalRow = partition.rows(index)
+    val values = new Array[Any](nonMetadataColumns.length)
+    nonMetadataColumns.zipWithIndex.foreach { case (col, idx) =>
+      values(idx) = extractFieldValue(col, tableSchema, originalRow)
+    }
+    addMetadata(new GenericInternalRow(values))
+  }
 
   override def close(): Unit = {}
+
+  private def extractFieldValue(
+      field: StructField,
+      schema: StructType,
+      row: InternalRow): Any = {
+    val index = schema.fieldIndex(field.name)

Review comment:
       Is case-sensitivity a problem here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r624215231



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -127,7 +128,8 @@ object SchemaPruning extends SQLConfHelper {
   private def getRootFields(expr: Expression): Seq[RootField] = {
     expr match {
       case att: Attribute =>
-        RootField(StructField(att.name, att.dataType, att.nullable), derivedFromAtt = true) :: Nil
+        RootField(StructField(att.name, att.dataType, att.nullable, att.metadata),

Review comment:
       Can we add a unit test in `SchemaPruningSuite`? We can make `getRootFields` as `private[spark]` to test it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-830402838


   > To test the above change, this also added the logic of column projection in InMemoryTable. Without the fix DSV2CharVarcharDDLTestSuite will fail.
   
   Oh, you mean that by applying `InMemoryTable` related change here, `DSV2CharVarcharDDLTestSuite` will fail because nested column pruning doesn't retain metadata now. Right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32354:
URL: https://github.com/apache/spark/pull/32354#discussion_r625532589



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala
##########
@@ -504,9 +509,39 @@ private class BufferedRowsReader(
     index < partition.rows.length
   }
 
-  override def get(): InternalRow = addMetadata(partition.rows(index))
+  override def get(): InternalRow = {
+    val originalRow = partition.rows(index)
+    val values = new Array[Any](nonMetadataColumns.length)
+    nonMetadataColumns.zipWithIndex.foreach { case (col, idx) =>
+      values(idx) = extractFieldValue(col, tableSchema, originalRow)
+    }
+    addMetadata(new GenericInternalRow(values))
+  }
 
   override def close(): Unit = {}
+
+  private def extractFieldValue(
+      field: StructField,
+      schema: StructType,
+      row: InternalRow): Any = {
+    val index = schema.fieldIndex(field.name)
+    field.dataType match {
+      case StructType(fields) =>
+        val childRow = row.toSeq(schema)(index).asInstanceOf[InternalRow]
+        if (childRow == null) {
+          return null
+        }
+        val childSchema = schema(index).dataType.asInstanceOf[StructType]
+        val resultValue = new Array[Any](fields.length)
+        fields.zipWithIndex.foreach { case (childField, idx) =>
+          val childValue = extractFieldValue(childField, childSchema, childRow)
+          resultValue(idx) = childValue
+        }
+        new GenericInternalRow(resultValue)
+      case dt =>
+        row.get(index, CharVarcharUtils.replaceCharVarcharWithString(dt))
+    }

Review comment:
       looks like we don't do char/varchar to string conversion before?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-827139573


   > Retain column metadata during the process of nested column pruning, when constructing StructField.
   To test the above change, this also added the logic of column projection in InMemoryTable. Without the fix DSV2CharVarcharDDLTestSuite will fail.
   
   Does it mean if only applying `SchemaPruning` change without `InMemoryTable`, `DSV2CharVarcharDDLTestSuite` will fail?
   
   But I just ran `DSV2CharVarcharDDLTestSuite` and it still passed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-832393163






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834142995


   Thanks. They don't seem related. I tested them locally and all passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org