You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/29 09:14:26 UTC

[GitHub] [spark] wangyum opened a new pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

wangyum opened a new pull request #31993:
URL: https://github.com/apache/spark/pull/31993


   ### What changes were proposed in this pull request?
   
   This pr add workaround(`set spark.sql.optimizer.nestedSchemaPruning.enabled=false`) to error message when `OrcUtils.requestedColumnIds` fails. For example:
   ```scala
   spark.sql(
     """
       |CREATE TABLE `t1` (
       |  `_col0` INT,
       |  `_col1` STRING,
       |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
       |  `_col3` STRING)
       |USING orc
       |PARTITIONED BY (_col3)
       |""".stripMargin)
   
   spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
   
   spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
   ```
   
   Before this pr:
   ```
   java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
   	at scala.Predef$.assert(Predef.scala:223)
   	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)
   ```
   After this pr:
   ```
   java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. Try to disable spark.sql.optimizer.nestedSchemaPruning.enabled to workaround this issue.
   	at scala.Predef$.assert(Predef.scala:223)
   	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)
   ```
   
   
   ### Why are the changes needed?
   
   Add a workaround.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Manual test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822197386






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614089538



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,8 +21,9 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+   * returned schema.

Review comment:
       let's call it out explicitly that the top-level fields are not pruned here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809494294


   **[Test build #136644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136644/testReport)** for PR 31993 at commit [`2a3f136`](https://github.com/apache/spark/commit/2a3f1367d6d90f34f2373c378910d4366052726a).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823719172


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822221636


   **[Test build #137582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137582/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820850952


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42024/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-810765696


   Sorry I may miss something. Why it's only a problem in nested column pruning but not column pruning?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823944803


   **[Test build #137730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137730/testReport)** for PR 31993 at commit [`4d0b510`](https://github.com/apache/spark/commit/4d0b510e10da4fe1fca07583ddecc5f5fe6d6392).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-821096738


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137470/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823788856


   **[Test build #137707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137707/testReport)** for PR 31993 at commit [`6112c9d`](https://github.com/apache/spark/commit/6112c9dd357da127646e164cdf6797f5801c1049).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823851109


   @wangyum there are conflicts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613388938



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       It seems we don't prune anything from the root fields now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822238041


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42135/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617212439



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
##########
@@ -40,7 +40,7 @@ import org.apache.spark.sql.types.{StructField, StructType}
 case class HadoopFsRelation(
     location: FileIndex,
     partitionSchema: StructType,
-    dataSchema: StructType,
+    dataSchema: StructType, // The top-level columns should not be pruned. Please see SPARK-34897.

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824135235


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137730/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820850963


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42024/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820985909


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42045/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809945765


   Can we automatically disable nested column pruning at executor side when we find the orc file schema is the by-position style? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613916765



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       It removed the  ``` `_col1` STRING ```. Nested column pruning still works. Please see these test suites: `ParquetSchemaPruningSuite`, `OrcV1SchemaPruningSuite` and `OrcV2SchemaPruningSuite`.
   
   That why it's only a problem in nested column pruning but not column pruning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614534596



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+   * returned schema.
+   * Note that:
+   *   1. The schema field ordering at original schema is still preserved in pruned schema.
+   *   2. The top-level fields are not pruned here.

Review comment:
       I think v2 column pruning should be handled by `V2ScanRelationPushDown`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613385216



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,8 +21,9 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.

Review comment:
       The old comment looks correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809528555


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809402566


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41226/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824116359


   **[Test build #137730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137730/testReport)** for PR 31993 at commit [`4d0b510`](https://github.com/apache/spark/commit/4d0b510e10da4fe1fca07583ddecc5f5fe6d6392).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-821095667


   **[Test build #137470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137470/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613385216



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,8 +21,9 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.

Review comment:
       The old comment looks correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823944803


   **[Test build #137730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137730/testReport)** for PR 31993 at commit [`4d0b510`](https://github.com/apache/spark/commit/4d0b510e10da4fe1fca07583ddecc5f5fe6d6392).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sandeep-katta commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
sandeep-katta commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617398009



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala
##########
@@ -60,44 +57,76 @@ class SchemaPruningSuite extends SparkFunSuite with SQLHelper {
       arrayOfStruct :: StructField("b", structOfStruct) :: StructField("c", IntegerType) ::
         mapOfStruct :: Nil)
 
-    testPrunedSchema(complexStruct, StructField("a", ArrayType(StructType.fromDDL("b int"))),
-      StructField("b", StructType.fromDDL("a int")))
     testPrunedSchema(complexStruct,
-      StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
-      StructField("b", StructType.fromDDL("b int")))
+      Seq(StructField("a", ArrayType(StructType.fromDDL("b int"))),
+        StructField("b", StructType.fromDDL("a int"))),
+      StructType(
+        StructField("a", ArrayType(StructType.fromDDL("b int"))) ::
+          StructField("b", StructType.fromDDL("a int")) ::
+          StructField("c", IntegerType) ::
+          mapOfStruct :: Nil))
+    testPrunedSchema(complexStruct,
+      Seq(StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
+        StructField("b", StructType.fromDDL("b int"))),
+      StructType(
+        StructField("a", ArrayType(StructType.fromDDL("b int, c string"))) ::
+          StructField("b", StructType.fromDDL("b int")) ::
+          StructField("c", IntegerType) ::
+          mapOfStruct :: Nil))
 
     val selectFieldInMap = StructField("d", MapType(StructType.fromDDL("a int, b int"),
       StructType.fromDDL("e int, f string")))
-    testPrunedSchema(complexStruct, StructField("c", IntegerType), selectFieldInMap)
+    testPrunedSchema(complexStruct,
+      Seq(StructField("c", IntegerType), selectFieldInMap),
+      StructType(
+        arrayOfStruct ::
+          StructField("b", structOfStruct) ::
+          StructField("c", IntegerType) ::
+          selectFieldInMap :: Nil))
   }
 
   test("SPARK-35096: test case insensitivity of pruned schema") {
-    Seq(true, false).foreach(isCaseSensitive => {
+    val upperCaseSchema = StructType.fromDDL("A struct<A:int, B:int>, B int")
+    val lowerCaseSchema = StructType.fromDDL("a struct<a:int, b:int>, b int")
+    val upperCaseRequestedFields = Seq(StructField("A", StructType.fromDDL("A int")))
+    val lowerCaseRequestedFields = Seq(StructField("a", StructType.fromDDL("a int")))
+
+    Seq(true, false).foreach { isCaseSensitive =>
       withSQLConf(CASE_SENSITIVE.key -> isCaseSensitive.toString) {
         if (isCaseSensitive) {
-          // Schema is case-sensitive
-          val requestedFields = getRootFields(StructField("id", IntegerType))
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"), requestedFields)
-          assert(prunedSchema == StructType(Seq.empty))
-          // Root fields are case-sensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructType(Seq.empty)))
+          testPrunedSchema(
+            upperCaseSchema,
+            upperCaseRequestedFields,
+            StructType.fromDDL("A struct<A:int>, B int"))
+          testPrunedSchema(
+            upperCaseSchema,
+            lowerCaseRequestedFields,
+            upperCaseSchema)
+
+          testPrunedSchema(
+            lowerCaseSchema,
+            upperCaseRequestedFields,
+            lowerCaseSchema)
+          testPrunedSchema(
+            lowerCaseSchema,
+            lowerCaseRequestedFields,
+            StructType.fromDDL("a struct<a:int>, b int"))
         } else {
-          // Schema is case-insensitive
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"),
-            getRootFields(StructField("id", IntegerType)))
-          assert(prunedSchema == StructType(StructField("ID", IntegerType) :: Nil))
-          // Root fields are case-insensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructField("id", IntegerType) :: Nil))
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              upperCaseSchema,
+              requestedFields,
+              StructType.fromDDL("A struct<A:int>, B int"))
+          }
+
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              lowerCaseSchema,
+              requestedFields,
+              StructType.fromDDL("a struct<a:int>, b int"))
+          }
         }
       }
-    })
+    }

Review comment:
       Tests LGTM, thanks for add more scenarios




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820982478






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824091442


   how far shall we backport? to 3.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617384262



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala
##########
@@ -60,44 +57,76 @@ class SchemaPruningSuite extends SparkFunSuite with SQLHelper {
       arrayOfStruct :: StructField("b", structOfStruct) :: StructField("c", IntegerType) ::
         mapOfStruct :: Nil)
 
-    testPrunedSchema(complexStruct, StructField("a", ArrayType(StructType.fromDDL("b int"))),
-      StructField("b", StructType.fromDDL("a int")))
     testPrunedSchema(complexStruct,
-      StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
-      StructField("b", StructType.fromDDL("b int")))
+      Seq(StructField("a", ArrayType(StructType.fromDDL("b int"))),
+        StructField("b", StructType.fromDDL("a int"))),
+      StructType(
+        StructField("a", ArrayType(StructType.fromDDL("b int"))) ::
+          StructField("b", StructType.fromDDL("a int")) ::
+          StructField("c", IntegerType) ::
+          mapOfStruct :: Nil))
+    testPrunedSchema(complexStruct,
+      Seq(StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
+        StructField("b", StructType.fromDDL("b int"))),
+      StructType(
+        StructField("a", ArrayType(StructType.fromDDL("b int, c string"))) ::
+          StructField("b", StructType.fromDDL("b int")) ::
+          StructField("c", IntegerType) ::
+          mapOfStruct :: Nil))
 
     val selectFieldInMap = StructField("d", MapType(StructType.fromDDL("a int, b int"),
       StructType.fromDDL("e int, f string")))
-    testPrunedSchema(complexStruct, StructField("c", IntegerType), selectFieldInMap)
+    testPrunedSchema(complexStruct,
+      Seq(StructField("c", IntegerType), selectFieldInMap),
+      StructType(
+        arrayOfStruct ::
+          StructField("b", structOfStruct) ::
+          StructField("c", IntegerType) ::
+          selectFieldInMap :: Nil))
   }
 
   test("SPARK-35096: test case insensitivity of pruned schema") {
-    Seq(true, false).foreach(isCaseSensitive => {
+    val upperCaseSchema = StructType.fromDDL("A struct<A:int, B:int>, B int")
+    val lowerCaseSchema = StructType.fromDDL("a struct<a:int, b:int>, b int")
+    val upperCaseRequestedFields = Seq(StructField("A", StructType.fromDDL("A int")))
+    val lowerCaseRequestedFields = Seq(StructField("a", StructType.fromDDL("a int")))
+
+    Seq(true, false).foreach { isCaseSensitive =>
       withSQLConf(CASE_SENSITIVE.key -> isCaseSensitive.toString) {
         if (isCaseSensitive) {
-          // Schema is case-sensitive
-          val requestedFields = getRootFields(StructField("id", IntegerType))
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"), requestedFields)
-          assert(prunedSchema == StructType(Seq.empty))
-          // Root fields are case-sensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructType(Seq.empty)))
+          testPrunedSchema(
+            upperCaseSchema,
+            upperCaseRequestedFields,
+            StructType.fromDDL("A struct<A:int>, B int"))
+          testPrunedSchema(
+            upperCaseSchema,
+            lowerCaseRequestedFields,
+            upperCaseSchema)
+
+          testPrunedSchema(
+            lowerCaseSchema,
+            upperCaseRequestedFields,
+            lowerCaseSchema)
+          testPrunedSchema(
+            lowerCaseSchema,
+            lowerCaseRequestedFields,
+            StructType.fromDDL("a struct<a:int>, b int"))
         } else {
-          // Schema is case-insensitive
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"),
-            getRootFields(StructField("id", IntegerType)))
-          assert(prunedSchema == StructType(StructField("ID", IntegerType) :: Nil))
-          // Root fields are case-insensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructField("id", IntegerType) :: Nil))
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              upperCaseSchema,
+              requestedFields,
+              StructType.fromDDL("A struct<A:int>, B int"))
+          }
+
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              lowerCaseSchema,
+              requestedFields,
+              StructType.fromDDL("a struct<a:int>, b int"))
+          }
         }
       }
-    })
+    }

Review comment:
       cc @sandeep-katta




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823975173


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42257/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r615556485



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
##########
@@ -40,7 +40,7 @@ import org.apache.spark.sql.types.{StructField, StructType}
 case class HadoopFsRelation(
     location: FileIndex,
     partitionSchema: StructType,
-    dataSchema: StructType,
+    dataSchema: StructType, // The top-level columns should not be pruned. Please see SPARK-34897.

Review comment:
       Can we put more details?
   ```
   // The top-level columns in `dataSchema` should match the actual physical file schema, otherwise
   // the ORC data source may not work with the by-ordinal mode.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822487303


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137582/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823800994


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137707/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613652117



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       ```scala
   spark.sql(
     """
       |CREATE TABLE t1 (
       |  _col0 INT,
       |  _col1 STRING,
       |  _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
       |USING ORC
       |""".stripMargin)
   
   
   spark.sql("SELECT _col0, _col2.c1 FROM t1").show
   ```
   Before this PR, the `pruneDataSchema` returns:
   ```
   `_col0` INT,`_col2` STRUCT<`c1`: STRING>
   ```
   
   After this PR, the `pruneDataSchema` returns:
   ```
   `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING>
   ```
   
   It only prune nested schemas.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809231987


   **[Test build #136644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136644/testReport)** for PR 31993 at commit [`2a3f136`](https://github.com/apache/spark/commit/2a3f1367d6d90f34f2373c378910d4366052726a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822221636


   **[Test build #137582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137582/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820953513


   **[Test build #137470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137470/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614521422



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+   * returned schema.
+   * Note that:
+   *   1. The schema field ordering at original schema is still preserved in pruned schema.
+   *   2. The top-level fields are not pruned here.

Review comment:
       No. Top-level columns pruned by:
   https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820850963






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823975173


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42257/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822197432


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42128/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817284156


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137176/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809253890


   isn't it a bug? cc @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r615553908



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the

Review comment:
       the example is incorrect now. This method doesn't prune top-level fields.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823701348


   **[Test build #137707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137707/testReport)** for PR 31993 at commit [`6112c9d`](https://github.com/apache/spark/commit/6112c9dd357da127646e164cdf6797f5801c1049).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613895096



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       What's wrong with the previous behavior? We can't sacrifice performance for all the cases only because the ORC by ordinal case is problematic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822349306


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137558/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824113226


   Yes. to 3.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824230580


   @wangyum There are conflicts in 3.1/3.0. Can you create backport PRs? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820849046


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42024/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613943891



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       is it because column pruning will be done by other rules so we don't need to consider it here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822257596


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42135/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820833460


   **[Test build #137449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137449/testReport)** for PR 31993 at commit [`a966bac`](https://github.com/apache/spark/commit/a966bac379ff6fed20c57cf3c748cada9da28b4c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r615553908



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the

Review comment:
       the example is incorrect now. This method doesn't prune root fields.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614090769



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       Can you provide the full code workflow to explain why this causes issues in ORC? I'm still not very sure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824229170


   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617276236



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is:
+   * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned

Review comment:
       `and given requested field are "a"` -> `and given requested field "s.a"`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614080724



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       > is it because column pruning will be done by other rules so we don't need to consider it here?
   
   Yes.
   
   https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213
   
   https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614491534



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       > > is it because column pruning will be done by other rules so we don't need to consider it here?
   > 
   > Yes.
   > 
   
   Hmm?  In `PushDownUtils.pruneColumns`, if you enable nested column pruning, Spark will only run the path of nested column pruning, not the quoted L96-97.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824135235


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137730/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823972587






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809944891


   It is a Hive ORC table in our production environment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614484444



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       1. Prune nested schema:
   https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L43
   
   2. Use this pruned nested schema to build the `dataSchema` in `Relation`
   https://github.com/apache/spark/blob/25e7d1ceee8c9f4ecb4ab796f51e9bcbc0500fae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L81-L86
   
   2. The `readDataColumns` is the complete column pruning:
   https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L226
   
   3. `dataSchema` from `relation.dataSchema`. It is the pruned nested schema:
   https://github.com/apache/spark/blob/935aa8c8db6824648483f26c2889c33030985259/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L398-L407
   
   4. `OrcUtils.requestedColumnIds` use this pruned nested schema:
   https://github.com/apache/spark/blob/1fc66f68703e0b14e03fe0ed5ca93e9af20f41a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L193-L197




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-821096738


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137470/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820908865


   **[Test build #137449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137449/testReport)** for PR 31993 at commit [`a966bac`](https://github.com/apache/spark/commit/a966bac379ff6fed20c57cf3c748cada9da28b4c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817280194


   **[Test build #137176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137176/testReport)** for PR 31993 at commit [`e64eb75`](https://github.com/apache/spark/commit/e64eb75aede71a5403a4d4436e63b1fcfdeca14d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822349306


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137558/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817256491


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41754/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613652117



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       ```scala
   spark.sql(
     """
       |CREATE TABLE t1 (
       |  _col0 INT,
       |  _col1 STRING,
       |  _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
       |USING ORC
       |""".stripMargin)
   
   
   spark.sql("SELECT _col0, _col2.c1 FROM t1").show
   ```
   
   The origin schema is:
   ```
   `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT> 
   ```
   
   Before this PR, the `pruneDataSchema` returns:
   ```
   `_col0` INT,`_col2` STRUCT<`c1`: STRING>
   ```
   
   After this PR, the `pruneDataSchema` returns:
   ```
   `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING>
   ```
   
   It only prune nested schemas.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820953513


   **[Test build #137470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137470/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617276637



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is:
+   * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned
+   * in the returned schema: `id int, struct<a:int>`.

Review comment:
       ditto, `id int, s struct<a:int>`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822165657


   **[Test build #137558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137558/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809430329


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41226/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r603811269



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -157,7 +158,8 @@ object OrcUtils extends Logging {
         // In these cases we map the physical schema to the data schema by index.
         assert(orcFieldNames.length <= dataSchema.length, "The given data schema " +
           s"${dataSchema.catalogString} has less fields than the actual ORC physical schema, " +
-          "no idea which columns were dropped, fail to read.")
+          "no idea which columns were dropped, fail to read. Try to disable " +
+          s"${NESTED_SCHEMA_PRUNING_ENABLED.key} to workaround this issue.")

Review comment:
       +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822197432


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42128/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809231987


   **[Test build #136644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136644/testReport)** for PR 31993 at commit [`2a3f136`](https://github.com/apache/spark/commit/2a3f1367d6d90f34f2373c378910d4366052726a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822464363


   **[Test build #137582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137582/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r603399343



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -157,7 +158,8 @@ object OrcUtils extends Logging {
         // In these cases we map the physical schema to the data schema by index.
         assert(orcFieldNames.length <= dataSchema.length, "The given data schema " +
           s"${dataSchema.catalogString} has less fields than the actual ORC physical schema, " +
-          "no idea which columns were dropped, fail to read.")
+          "no idea which columns were dropped, fail to read. Try to disable " +
+          s"${NESTED_SCHEMA_PRUNING_ENABLED.key} to workaround this issue.")

Review comment:
       Although `spark.sql.optimizer.nestedSchemaPruning.enabled` is true by default, can we hide this conf guide when `spark.sql.optimizer.nestedSchemaPruning.enabled=false`, @wangyum ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823719172


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809430329


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41226/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824229170


   Thanks! Merging to master/3.1/3.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822347639


   **[Test build #137558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137558/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sandeep-katta0102 commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
sandeep-katta0102 commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617395326



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala
##########
@@ -60,44 +57,76 @@ class SchemaPruningSuite extends SparkFunSuite with SQLHelper {
       arrayOfStruct :: StructField("b", structOfStruct) :: StructField("c", IntegerType) ::
         mapOfStruct :: Nil)
 
-    testPrunedSchema(complexStruct, StructField("a", ArrayType(StructType.fromDDL("b int"))),
-      StructField("b", StructType.fromDDL("a int")))
     testPrunedSchema(complexStruct,
-      StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
-      StructField("b", StructType.fromDDL("b int")))
+      Seq(StructField("a", ArrayType(StructType.fromDDL("b int"))),
+        StructField("b", StructType.fromDDL("a int"))),
+      StructType(
+        StructField("a", ArrayType(StructType.fromDDL("b int"))) ::
+          StructField("b", StructType.fromDDL("a int")) ::
+          StructField("c", IntegerType) ::
+          mapOfStruct :: Nil))
+    testPrunedSchema(complexStruct,
+      Seq(StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
+        StructField("b", StructType.fromDDL("b int"))),
+      StructType(
+        StructField("a", ArrayType(StructType.fromDDL("b int, c string"))) ::
+          StructField("b", StructType.fromDDL("b int")) ::
+          StructField("c", IntegerType) ::
+          mapOfStruct :: Nil))
 
     val selectFieldInMap = StructField("d", MapType(StructType.fromDDL("a int, b int"),
       StructType.fromDDL("e int, f string")))
-    testPrunedSchema(complexStruct, StructField("c", IntegerType), selectFieldInMap)
+    testPrunedSchema(complexStruct,
+      Seq(StructField("c", IntegerType), selectFieldInMap),
+      StructType(
+        arrayOfStruct ::
+          StructField("b", structOfStruct) ::
+          StructField("c", IntegerType) ::
+          selectFieldInMap :: Nil))
   }
 
   test("SPARK-35096: test case insensitivity of pruned schema") {
-    Seq(true, false).foreach(isCaseSensitive => {
+    val upperCaseSchema = StructType.fromDDL("A struct<A:int, B:int>, B int")
+    val lowerCaseSchema = StructType.fromDDL("a struct<a:int, b:int>, b int")
+    val upperCaseRequestedFields = Seq(StructField("A", StructType.fromDDL("A int")))
+    val lowerCaseRequestedFields = Seq(StructField("a", StructType.fromDDL("a int")))
+
+    Seq(true, false).foreach { isCaseSensitive =>
       withSQLConf(CASE_SENSITIVE.key -> isCaseSensitive.toString) {
         if (isCaseSensitive) {
-          // Schema is case-sensitive
-          val requestedFields = getRootFields(StructField("id", IntegerType))
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"), requestedFields)
-          assert(prunedSchema == StructType(Seq.empty))
-          // Root fields are case-sensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructType(Seq.empty)))
+          testPrunedSchema(
+            upperCaseSchema,
+            upperCaseRequestedFields,
+            StructType.fromDDL("A struct<A:int>, B int"))
+          testPrunedSchema(
+            upperCaseSchema,
+            lowerCaseRequestedFields,
+            upperCaseSchema)
+
+          testPrunedSchema(
+            lowerCaseSchema,
+            upperCaseRequestedFields,
+            lowerCaseSchema)
+          testPrunedSchema(
+            lowerCaseSchema,
+            lowerCaseRequestedFields,
+            StructType.fromDDL("a struct<a:int>, b int"))
         } else {
-          // Schema is case-insensitive
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"),
-            getRootFields(StructField("id", IntegerType)))
-          assert(prunedSchema == StructType(StructField("ID", IntegerType) :: Nil))
-          // Root fields are case-insensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructField("id", IntegerType) :: Nil))
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              upperCaseSchema,
+              requestedFields,
+              StructType.fromDDL("A struct<A:int>, B int"))
+          }
+
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              lowerCaseSchema,
+              requestedFields,
+              StructType.fromDDL("a struct<a:int>, b int"))
+          }
         }
       }
-    })
+    }

Review comment:
       Tests LGTM, thanks for adding more scenarios




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822165657


   **[Test build #137558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137558/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817257211


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41754/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613389446



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       if this is the case please update the document of this method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820922563


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137449/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614490313



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       It is because `requestedColumnIds` will check if given data schema has less fields than physical schema in ORC file.
   
   Under nested column pruning, Spark will let data source use pruned schema as data schema to read files. E.g., Spark prune `_col1`, for the above example. But the ORC file has three top-level fields `_col0`, `_col1`, and `_col2`, so the check in `requestedColumnIds` will fail on the case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809395032


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41226/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617275883



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is:
+   * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned

Review comment:
       top-level columns need to have a name, `id int, s struct<a:int, b:int>`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617212769



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the

Review comment:
       Change the example to `id int, struct<a:int, b:int>` .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613388072



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       what's the actual difference? can you give a simple example?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614539225



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+   * returned schema.
+   * Note that:
+   *   1. The schema field ordering at original schema is still preserved in pruned schema.
+   *   2. The top-level fields are not pruned here.

Review comment:
       Yes, I have updated the v2 part:
   https://github.com/apache/spark/blob/a966bac379ff6fed20c57cf3c748cada9da28b4c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L79-L110




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809528555


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817284156


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137176/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817251437


   **[Test build #137176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137176/testReport)** for PR 31993 at commit [`e64eb75`](https://github.com/apache/spark/commit/e64eb75aede71a5403a4d4436e63b1fcfdeca14d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820985909


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42045/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r612033519



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala
##########
@@ -89,14 +93,12 @@ object PushDownUtils extends PredicateHelper {
         } else {
           new StructType()
         }
-        r.pruneColumns(prunedSchema)
+        val neededFieldNames = neededOutput.map(_.name).toSet
+        r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))

Review comment:
       Move [filter logical from `SchemaPruning`](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L38-L39) to `PushDownUtils` to support datasource V2 column pruning.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817255876


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41754/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817257211


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41754/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613943448



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
     // in the resulting schema may differ from their ordering in the logical relation's
     // original schema
     val mergedSchema = requestedRootFields
-      .map { case root: RootField => StructType(Array(root.field)) }
+      .map { root: RootField => StructType(Array(root.field)) }
       .reduceLeft(_ merge _)
-    val dataSchemaFieldNames = dataSchema.fieldNames.toSet
     val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))

Review comment:
       I don't know the details enough to understand why nested column pruning still works after the change here. @viirya can you take a look?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822257596


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42135/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822487303


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137582/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817251437


   **[Test build #137176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137176/testReport)** for PR 31993 at commit [`e64eb75`](https://github.com/apache/spark/commit/e64eb75aede71a5403a4d4436e63b1fcfdeca14d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823715879


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617276414



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is:
+   * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned

Review comment:
       `the field "b" is pruned` -> `the inner field "b" ...`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817253696


   > Sorry I may miss something. Why it's only a problem in nested column pruning but not column pruning?
   
   Nested column pruning removed the field:
   https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823717509


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya closed pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya closed pull request #31993:
URL: https://github.com/apache/spark/pull/31993


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-810724594


   Can we disable column pruning when it is Hive ORC table?
   https://github.com/apache/spark/blob/25e7d1ceee8c9f4ecb4ab796f51e9bcbc0500fae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L98-L100
   Update `canPruneRelation` to:
   ```scala
     private def canPruneRelation(fsRelation: HadoopFsRelation) = {
       fsRelation.fileFormat match {
         case _: ParquetFileFormat => true
         case _: OrcFileFormat =>
           fsRelation.location match {
             case c: CatalogFileIndex =>
               !c.table.provider.contains(DDLUtils.HIVE_PROVIDER)
             case _ => true
           }
       }
     }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614488314



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
 
 object SchemaPruning {
   /**
-   * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is
+   * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+   * returned schema.
+   * Note that:
+   *   1. The schema field ordering at original schema is still preserved in pruned schema.
+   *   2. The top-level fields are not pruned here.

Review comment:
       Hmm, doesn't it mean we miss the change to prune top-level columns?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809943947


   As nested column pruning rule is far from the point we get the physical information of ORC files, and this should be a narrow case, it looks okay to me to inform users a possible workaround here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823940506


   > @wangyum there are conflicts
   
   Fixed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822243416


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42135/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823800994


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137707/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820833460


   **[Test build #137449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137449/testReport)** for PR 31993 at commit [`a966bac`](https://github.com/apache/spark/commit/a966bac379ff6fed20c57cf3c748cada9da28b4c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823701348


   **[Test build #137707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137707/testReport)** for PR 31993 at commit [`6112c9d`](https://github.com/apache/spark/commit/6112c9dd357da127646e164cdf6797f5801c1049).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org