You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/29 09:14:26 UTC
[GitHub] [spark] wangyum opened a new pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
wangyum opened a new pull request #31993:
URL: https://github.com/apache/spark/pull/31993
### What changes were proposed in this pull request?
This pr add workaround(`set spark.sql.optimizer.nestedSchemaPruning.enabled=false`) to error message when `OrcUtils.requestedColumnIds` fails. For example:
```scala
spark.sql(
"""
|CREATE TABLE `t1` (
| `_col0` INT,
| `_col1` STRING,
| `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
| `_col3` STRING)
|USING orc
|PARTITIONED BY (_col3)
|""".stripMargin)
spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
```
Before this pr:
```
java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)
```
After this pr:
```
java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. Try to disable spark.sql.optimizer.nestedSchemaPruning.enabled to workaround this issue.
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)
```
### Why are the changes needed?
Add a workaround.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822197386
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614089538
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,8 +21,9 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+ * returned schema.
Review comment:
let's call it out explicitly that the top-level fields are not pruned here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809494294
**[Test build #136644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136644/testReport)** for PR 31993 at commit [`2a3f136`](https://github.com/apache/spark/commit/2a3f1367d6d90f34f2373c378910d4366052726a).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823719172
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42235/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822221636
**[Test build #137582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137582/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820850952
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42024/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-810765696
Sorry I may miss something. Why it's only a problem in nested column pruning but not column pruning?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823944803
**[Test build #137730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137730/testReport)** for PR 31993 at commit [`4d0b510`](https://github.com/apache/spark/commit/4d0b510e10da4fe1fca07583ddecc5f5fe6d6392).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-821096738
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137470/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823788856
**[Test build #137707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137707/testReport)** for PR 31993 at commit [`6112c9d`](https://github.com/apache/spark/commit/6112c9dd357da127646e164cdf6797f5801c1049).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823851109
@wangyum there are conflicts
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613388938
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
It seems we don't prune anything from the root fields now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822238041
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42135/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617212439
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
##########
@@ -40,7 +40,7 @@ import org.apache.spark.sql.types.{StructField, StructType}
case class HadoopFsRelation(
location: FileIndex,
partitionSchema: StructType,
- dataSchema: StructType,
+ dataSchema: StructType, // The top-level columns should not be pruned. Please see SPARK-34897.
Review comment:
Done.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824135235
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137730/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820850963
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42024/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820985909
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42045/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809945765
Can we automatically disable nested column pruning at executor side when we find the orc file schema is the by-position style?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613916765
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
It removed the ``` `_col1` STRING ```. Nested column pruning still works. Please see these test suites: `ParquetSchemaPruningSuite`, `OrcV1SchemaPruningSuite` and `OrcV2SchemaPruningSuite`.
That why it's only a problem in nested column pruning but not column pruning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614534596
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+ * returned schema.
+ * Note that:
+ * 1. The schema field ordering at original schema is still preserved in pruned schema.
+ * 2. The top-level fields are not pruned here.
Review comment:
I think v2 column pruning should be handled by `V2ScanRelationPushDown`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613385216
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,8 +21,9 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
Review comment:
The old comment looks correct.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809528555
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136644/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809402566
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41226/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824116359
**[Test build #137730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137730/testReport)** for PR 31993 at commit [`4d0b510`](https://github.com/apache/spark/commit/4d0b510e10da4fe1fca07583ddecc5f5fe6d6392).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-821095667
**[Test build #137470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137470/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613385216
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,8 +21,9 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
Review comment:
The old comment looks correct.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823944803
**[Test build #137730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137730/testReport)** for PR 31993 at commit [`4d0b510`](https://github.com/apache/spark/commit/4d0b510e10da4fe1fca07583ddecc5f5fe6d6392).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sandeep-katta commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
sandeep-katta commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617398009
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala
##########
@@ -60,44 +57,76 @@ class SchemaPruningSuite extends SparkFunSuite with SQLHelper {
arrayOfStruct :: StructField("b", structOfStruct) :: StructField("c", IntegerType) ::
mapOfStruct :: Nil)
- testPrunedSchema(complexStruct, StructField("a", ArrayType(StructType.fromDDL("b int"))),
- StructField("b", StructType.fromDDL("a int")))
testPrunedSchema(complexStruct,
- StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
- StructField("b", StructType.fromDDL("b int")))
+ Seq(StructField("a", ArrayType(StructType.fromDDL("b int"))),
+ StructField("b", StructType.fromDDL("a int"))),
+ StructType(
+ StructField("a", ArrayType(StructType.fromDDL("b int"))) ::
+ StructField("b", StructType.fromDDL("a int")) ::
+ StructField("c", IntegerType) ::
+ mapOfStruct :: Nil))
+ testPrunedSchema(complexStruct,
+ Seq(StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
+ StructField("b", StructType.fromDDL("b int"))),
+ StructType(
+ StructField("a", ArrayType(StructType.fromDDL("b int, c string"))) ::
+ StructField("b", StructType.fromDDL("b int")) ::
+ StructField("c", IntegerType) ::
+ mapOfStruct :: Nil))
val selectFieldInMap = StructField("d", MapType(StructType.fromDDL("a int, b int"),
StructType.fromDDL("e int, f string")))
- testPrunedSchema(complexStruct, StructField("c", IntegerType), selectFieldInMap)
+ testPrunedSchema(complexStruct,
+ Seq(StructField("c", IntegerType), selectFieldInMap),
+ StructType(
+ arrayOfStruct ::
+ StructField("b", structOfStruct) ::
+ StructField("c", IntegerType) ::
+ selectFieldInMap :: Nil))
}
test("SPARK-35096: test case insensitivity of pruned schema") {
- Seq(true, false).foreach(isCaseSensitive => {
+ val upperCaseSchema = StructType.fromDDL("A struct<A:int, B:int>, B int")
+ val lowerCaseSchema = StructType.fromDDL("a struct<a:int, b:int>, b int")
+ val upperCaseRequestedFields = Seq(StructField("A", StructType.fromDDL("A int")))
+ val lowerCaseRequestedFields = Seq(StructField("a", StructType.fromDDL("a int")))
+
+ Seq(true, false).foreach { isCaseSensitive =>
withSQLConf(CASE_SENSITIVE.key -> isCaseSensitive.toString) {
if (isCaseSensitive) {
- // Schema is case-sensitive
- val requestedFields = getRootFields(StructField("id", IntegerType))
- val prunedSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("ID int, name String"), requestedFields)
- assert(prunedSchema == StructType(Seq.empty))
- // Root fields are case-sensitive
- val rootFieldsSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("id int, name String"),
- getRootFields(StructField("ID", IntegerType)))
- assert(rootFieldsSchema == StructType(StructType(Seq.empty)))
+ testPrunedSchema(
+ upperCaseSchema,
+ upperCaseRequestedFields,
+ StructType.fromDDL("A struct<A:int>, B int"))
+ testPrunedSchema(
+ upperCaseSchema,
+ lowerCaseRequestedFields,
+ upperCaseSchema)
+
+ testPrunedSchema(
+ lowerCaseSchema,
+ upperCaseRequestedFields,
+ lowerCaseSchema)
+ testPrunedSchema(
+ lowerCaseSchema,
+ lowerCaseRequestedFields,
+ StructType.fromDDL("a struct<a:int>, b int"))
} else {
- // Schema is case-insensitive
- val prunedSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("ID int, name String"),
- getRootFields(StructField("id", IntegerType)))
- assert(prunedSchema == StructType(StructField("ID", IntegerType) :: Nil))
- // Root fields are case-insensitive
- val rootFieldsSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("id int, name String"),
- getRootFields(StructField("ID", IntegerType)))
- assert(rootFieldsSchema == StructType(StructField("id", IntegerType) :: Nil))
+ Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+ testPrunedSchema(
+ upperCaseSchema,
+ requestedFields,
+ StructType.fromDDL("A struct<A:int>, B int"))
+ }
+
+ Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+ testPrunedSchema(
+ lowerCaseSchema,
+ requestedFields,
+ StructType.fromDDL("a struct<a:int>, b int"))
+ }
}
}
- })
+ }
Review comment:
Tests LGTM, thanks for add more scenarios
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820982478
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824091442
how far shall we backport? to 3.0?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617384262
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala
##########
@@ -60,44 +57,76 @@ class SchemaPruningSuite extends SparkFunSuite with SQLHelper {
arrayOfStruct :: StructField("b", structOfStruct) :: StructField("c", IntegerType) ::
mapOfStruct :: Nil)
- testPrunedSchema(complexStruct, StructField("a", ArrayType(StructType.fromDDL("b int"))),
- StructField("b", StructType.fromDDL("a int")))
testPrunedSchema(complexStruct,
- StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
- StructField("b", StructType.fromDDL("b int")))
+ Seq(StructField("a", ArrayType(StructType.fromDDL("b int"))),
+ StructField("b", StructType.fromDDL("a int"))),
+ StructType(
+ StructField("a", ArrayType(StructType.fromDDL("b int"))) ::
+ StructField("b", StructType.fromDDL("a int")) ::
+ StructField("c", IntegerType) ::
+ mapOfStruct :: Nil))
+ testPrunedSchema(complexStruct,
+ Seq(StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
+ StructField("b", StructType.fromDDL("b int"))),
+ StructType(
+ StructField("a", ArrayType(StructType.fromDDL("b int, c string"))) ::
+ StructField("b", StructType.fromDDL("b int")) ::
+ StructField("c", IntegerType) ::
+ mapOfStruct :: Nil))
val selectFieldInMap = StructField("d", MapType(StructType.fromDDL("a int, b int"),
StructType.fromDDL("e int, f string")))
- testPrunedSchema(complexStruct, StructField("c", IntegerType), selectFieldInMap)
+ testPrunedSchema(complexStruct,
+ Seq(StructField("c", IntegerType), selectFieldInMap),
+ StructType(
+ arrayOfStruct ::
+ StructField("b", structOfStruct) ::
+ StructField("c", IntegerType) ::
+ selectFieldInMap :: Nil))
}
test("SPARK-35096: test case insensitivity of pruned schema") {
- Seq(true, false).foreach(isCaseSensitive => {
+ val upperCaseSchema = StructType.fromDDL("A struct<A:int, B:int>, B int")
+ val lowerCaseSchema = StructType.fromDDL("a struct<a:int, b:int>, b int")
+ val upperCaseRequestedFields = Seq(StructField("A", StructType.fromDDL("A int")))
+ val lowerCaseRequestedFields = Seq(StructField("a", StructType.fromDDL("a int")))
+
+ Seq(true, false).foreach { isCaseSensitive =>
withSQLConf(CASE_SENSITIVE.key -> isCaseSensitive.toString) {
if (isCaseSensitive) {
- // Schema is case-sensitive
- val requestedFields = getRootFields(StructField("id", IntegerType))
- val prunedSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("ID int, name String"), requestedFields)
- assert(prunedSchema == StructType(Seq.empty))
- // Root fields are case-sensitive
- val rootFieldsSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("id int, name String"),
- getRootFields(StructField("ID", IntegerType)))
- assert(rootFieldsSchema == StructType(StructType(Seq.empty)))
+ testPrunedSchema(
+ upperCaseSchema,
+ upperCaseRequestedFields,
+ StructType.fromDDL("A struct<A:int>, B int"))
+ testPrunedSchema(
+ upperCaseSchema,
+ lowerCaseRequestedFields,
+ upperCaseSchema)
+
+ testPrunedSchema(
+ lowerCaseSchema,
+ upperCaseRequestedFields,
+ lowerCaseSchema)
+ testPrunedSchema(
+ lowerCaseSchema,
+ lowerCaseRequestedFields,
+ StructType.fromDDL("a struct<a:int>, b int"))
} else {
- // Schema is case-insensitive
- val prunedSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("ID int, name String"),
- getRootFields(StructField("id", IntegerType)))
- assert(prunedSchema == StructType(StructField("ID", IntegerType) :: Nil))
- // Root fields are case-insensitive
- val rootFieldsSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("id int, name String"),
- getRootFields(StructField("ID", IntegerType)))
- assert(rootFieldsSchema == StructType(StructField("id", IntegerType) :: Nil))
+ Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+ testPrunedSchema(
+ upperCaseSchema,
+ requestedFields,
+ StructType.fromDDL("A struct<A:int>, B int"))
+ }
+
+ Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+ testPrunedSchema(
+ lowerCaseSchema,
+ requestedFields,
+ StructType.fromDDL("a struct<a:int>, b int"))
+ }
}
}
- })
+ }
Review comment:
cc @sandeep-katta
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823975173
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42257/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r615556485
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
##########
@@ -40,7 +40,7 @@ import org.apache.spark.sql.types.{StructField, StructType}
case class HadoopFsRelation(
location: FileIndex,
partitionSchema: StructType,
- dataSchema: StructType,
+ dataSchema: StructType, // The top-level columns should not be pruned. Please see SPARK-34897.
Review comment:
Can we put more details?
```
// The top-level columns in `dataSchema` should match the actual physical file schema, otherwise
// the ORC data source may not work with the by-ordinal mode.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822487303
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137582/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823800994
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137707/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613652117
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
```scala
spark.sql(
"""
|CREATE TABLE t1 (
| _col0 INT,
| _col1 STRING,
| _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
|USING ORC
|""".stripMargin)
spark.sql("SELECT _col0, _col2.c1 FROM t1").show
```
Before this PR, the `pruneDataSchema` returns:
```
`_col0` INT,`_col2` STRUCT<`c1`: STRING>
```
After this PR, the `pruneDataSchema` returns:
```
`_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING>
```
It only prune nested schemas.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809231987
**[Test build #136644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136644/testReport)** for PR 31993 at commit [`2a3f136`](https://github.com/apache/spark/commit/2a3f1367d6d90f34f2373c378910d4366052726a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822221636
**[Test build #137582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137582/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820953513
**[Test build #137470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137470/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614521422
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+ * returned schema.
+ * Note that:
+ * 1. The schema field ordering at original schema is still preserved in pruned schema.
+ * 2. The top-level fields are not pruned here.
Review comment:
No. Top-level columns pruned by:
https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820850963
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823975173
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42257/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822197432
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42128/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817284156
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137176/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809253890
isn't it a bug? cc @viirya
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r615553908
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
Review comment:
the example is incorrect now. This method doesn't prune top-level fields.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823701348
**[Test build #137707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137707/testReport)** for PR 31993 at commit [`6112c9d`](https://github.com/apache/spark/commit/6112c9dd357da127646e164cdf6797f5801c1049).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613895096
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
What's wrong with the previous behavior? We can't sacrifice performance for all the cases only because the ORC by ordinal case is problematic.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822349306
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137558/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824113226
Yes. to 3.0.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824230580
@wangyum There are conflicts in 3.1/3.0. Can you create backport PRs? Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820849046
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42024/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613943891
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
is it because column pruning will be done by other rules so we don't need to consider it here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822257596
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42135/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820833460
**[Test build #137449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137449/testReport)** for PR 31993 at commit [`a966bac`](https://github.com/apache/spark/commit/a966bac379ff6fed20c57cf3c748cada9da28b4c).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r615553908
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
Review comment:
the example is incorrect now. This method doesn't prune root fields.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614090769
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
Can you provide the full code workflow to explain why this causes issues in ORC? I'm still not very sure.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya edited a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824229170
Thanks! Merging to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617276236
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is:
+ * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned
Review comment:
`and given requested field are "a"` -> `and given requested field "s.a"`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614080724
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
> is it because column pruning will be done by other rules so we don't need to consider it here?
Yes.
https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213
https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614491534
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
> > is it because column pruning will be done by other rules so we don't need to consider it here?
>
> Yes.
>
Hmm? In `PushDownUtils.pruneColumns`, if you enable nested column pruning, Spark will only run the path of nested column pruning, not the quoted L96-97.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824135235
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137730/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823972587
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809944891
It is a Hive ORC table in our production environment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614484444
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
1. Prune nested schema:
https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L43
2. Use this pruned nested schema to build the `dataSchema` in `Relation`
https://github.com/apache/spark/blob/25e7d1ceee8c9f4ecb4ab796f51e9bcbc0500fae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L81-L86
2. The `readDataColumns` is the complete column pruning:
https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L226
3. `dataSchema` from `relation.dataSchema`. It is the pruned nested schema:
https://github.com/apache/spark/blob/935aa8c8db6824648483f26c2889c33030985259/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L398-L407
4. `OrcUtils.requestedColumnIds` use this pruned nested schema:
https://github.com/apache/spark/blob/1fc66f68703e0b14e03fe0ed5ca93e9af20f41a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L193-L197
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-821096738
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137470/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820908865
**[Test build #137449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137449/testReport)** for PR 31993 at commit [`a966bac`](https://github.com/apache/spark/commit/a966bac379ff6fed20c57cf3c748cada9da28b4c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817280194
**[Test build #137176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137176/testReport)** for PR 31993 at commit [`e64eb75`](https://github.com/apache/spark/commit/e64eb75aede71a5403a4d4436e63b1fcfdeca14d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822349306
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137558/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817256491
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41754/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613652117
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
```scala
spark.sql(
"""
|CREATE TABLE t1 (
| _col0 INT,
| _col1 STRING,
| _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
|USING ORC
|""".stripMargin)
spark.sql("SELECT _col0, _col2.c1 FROM t1").show
```
The origin schema is:
```
`_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>
```
Before this PR, the `pruneDataSchema` returns:
```
`_col0` INT,`_col2` STRUCT<`c1`: STRING>
```
After this PR, the `pruneDataSchema` returns:
```
`_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING>
```
It only prune nested schemas.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820953513
**[Test build #137470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137470/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617276637
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is:
+ * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned
+ * in the returned schema: `id int, struct<a:int>`.
Review comment:
ditto, `id int, s struct<a:int>`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822165657
**[Test build #137558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137558/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809430329
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41226/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r603811269
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -157,7 +158,8 @@ object OrcUtils extends Logging {
// In these cases we map the physical schema to the data schema by index.
assert(orcFieldNames.length <= dataSchema.length, "The given data schema " +
s"${dataSchema.catalogString} has less fields than the actual ORC physical schema, " +
- "no idea which columns were dropped, fail to read.")
+ "no idea which columns were dropped, fail to read. Try to disable " +
+ s"${NESTED_SCHEMA_PRUNING_ENABLED.key} to workaround this issue.")
Review comment:
+1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822197432
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42128/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809231987
**[Test build #136644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136644/testReport)** for PR 31993 at commit [`2a3f136`](https://github.com/apache/spark/commit/2a3f1367d6d90f34f2373c378910d4366052726a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822464363
**[Test build #137582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137582/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r603399343
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -157,7 +158,8 @@ object OrcUtils extends Logging {
// In these cases we map the physical schema to the data schema by index.
assert(orcFieldNames.length <= dataSchema.length, "The given data schema " +
s"${dataSchema.catalogString} has less fields than the actual ORC physical schema, " +
- "no idea which columns were dropped, fail to read.")
+ "no idea which columns were dropped, fail to read. Try to disable " +
+ s"${NESTED_SCHEMA_PRUNING_ENABLED.key} to workaround this issue.")
Review comment:
Although `spark.sql.optimizer.nestedSchemaPruning.enabled` is true by default, can we hide this conf guide when `spark.sql.optimizer.nestedSchemaPruning.enabled=false`, @wangyum ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823719172
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42235/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809430329
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41226/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-824229170
Thanks! Merging to master/3.1/3.0.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822347639
**[Test build #137558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137558/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sandeep-katta0102 commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
sandeep-katta0102 commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617395326
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala
##########
@@ -60,44 +57,76 @@ class SchemaPruningSuite extends SparkFunSuite with SQLHelper {
arrayOfStruct :: StructField("b", structOfStruct) :: StructField("c", IntegerType) ::
mapOfStruct :: Nil)
- testPrunedSchema(complexStruct, StructField("a", ArrayType(StructType.fromDDL("b int"))),
- StructField("b", StructType.fromDDL("a int")))
testPrunedSchema(complexStruct,
- StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
- StructField("b", StructType.fromDDL("b int")))
+ Seq(StructField("a", ArrayType(StructType.fromDDL("b int"))),
+ StructField("b", StructType.fromDDL("a int"))),
+ StructType(
+ StructField("a", ArrayType(StructType.fromDDL("b int"))) ::
+ StructField("b", StructType.fromDDL("a int")) ::
+ StructField("c", IntegerType) ::
+ mapOfStruct :: Nil))
+ testPrunedSchema(complexStruct,
+ Seq(StructField("a", ArrayType(StructType.fromDDL("b int, c string"))),
+ StructField("b", StructType.fromDDL("b int"))),
+ StructType(
+ StructField("a", ArrayType(StructType.fromDDL("b int, c string"))) ::
+ StructField("b", StructType.fromDDL("b int")) ::
+ StructField("c", IntegerType) ::
+ mapOfStruct :: Nil))
val selectFieldInMap = StructField("d", MapType(StructType.fromDDL("a int, b int"),
StructType.fromDDL("e int, f string")))
- testPrunedSchema(complexStruct, StructField("c", IntegerType), selectFieldInMap)
+ testPrunedSchema(complexStruct,
+ Seq(StructField("c", IntegerType), selectFieldInMap),
+ StructType(
+ arrayOfStruct ::
+ StructField("b", structOfStruct) ::
+ StructField("c", IntegerType) ::
+ selectFieldInMap :: Nil))
}
test("SPARK-35096: test case insensitivity of pruned schema") {
- Seq(true, false).foreach(isCaseSensitive => {
+ val upperCaseSchema = StructType.fromDDL("A struct<A:int, B:int>, B int")
+ val lowerCaseSchema = StructType.fromDDL("a struct<a:int, b:int>, b int")
+ val upperCaseRequestedFields = Seq(StructField("A", StructType.fromDDL("A int")))
+ val lowerCaseRequestedFields = Seq(StructField("a", StructType.fromDDL("a int")))
+
+ Seq(true, false).foreach { isCaseSensitive =>
withSQLConf(CASE_SENSITIVE.key -> isCaseSensitive.toString) {
if (isCaseSensitive) {
- // Schema is case-sensitive
- val requestedFields = getRootFields(StructField("id", IntegerType))
- val prunedSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("ID int, name String"), requestedFields)
- assert(prunedSchema == StructType(Seq.empty))
- // Root fields are case-sensitive
- val rootFieldsSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("id int, name String"),
- getRootFields(StructField("ID", IntegerType)))
- assert(rootFieldsSchema == StructType(StructType(Seq.empty)))
+ testPrunedSchema(
+ upperCaseSchema,
+ upperCaseRequestedFields,
+ StructType.fromDDL("A struct<A:int>, B int"))
+ testPrunedSchema(
+ upperCaseSchema,
+ lowerCaseRequestedFields,
+ upperCaseSchema)
+
+ testPrunedSchema(
+ lowerCaseSchema,
+ upperCaseRequestedFields,
+ lowerCaseSchema)
+ testPrunedSchema(
+ lowerCaseSchema,
+ lowerCaseRequestedFields,
+ StructType.fromDDL("a struct<a:int>, b int"))
} else {
- // Schema is case-insensitive
- val prunedSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("ID int, name String"),
- getRootFields(StructField("id", IntegerType)))
- assert(prunedSchema == StructType(StructField("ID", IntegerType) :: Nil))
- // Root fields are case-insensitive
- val rootFieldsSchema = SchemaPruning.pruneDataSchema(
- StructType.fromDDL("id int, name String"),
- getRootFields(StructField("ID", IntegerType)))
- assert(rootFieldsSchema == StructType(StructField("id", IntegerType) :: Nil))
+ Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+ testPrunedSchema(
+ upperCaseSchema,
+ requestedFields,
+ StructType.fromDDL("A struct<A:int>, B int"))
+ }
+
+ Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+ testPrunedSchema(
+ lowerCaseSchema,
+ requestedFields,
+ StructType.fromDDL("a struct<a:int>, b int"))
+ }
}
}
- })
+ }
Review comment:
Tests LGTM, thanks for adding more scenarios
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822165657
**[Test build #137558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137558/testReport)** for PR 31993 at commit [`c04864a`](https://github.com/apache/spark/commit/c04864a92c7db8c5845e2f0ac13b77718f3e3e85).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817257211
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41754/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613389446
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
if this is the case please update the document of this method.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820922563
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137449/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614490313
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
It is because `requestedColumnIds` will check if given data schema has less fields than physical schema in ORC file.
Under nested column pruning, Spark will let data source use pruned schema as data schema to read files. E.g., Spark prune `_col1`, for the above example. But the ORC file has three top-level fields `_col0`, `_col1`, and `_col2`, so the check in `requestedColumnIds` will fail on the case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809395032
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41226/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617275883
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is:
+ * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned
Review comment:
top-level columns need to have a name, `id int, s struct<a:int, b:int>`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617212769
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
Review comment:
Change the example to `id int, struct<a:int, b:int>` .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613388072
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
what's the actual difference? can you give a simple example?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614539225
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+ * returned schema.
+ * Note that:
+ * 1. The schema field ordering at original schema is still preserved in pruned schema.
+ * 2. The top-level fields are not pruned here.
Review comment:
Yes, I have updated the v2 part:
https://github.com/apache/spark/blob/a966bac379ff6fed20c57cf3c748cada9da28b4c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L79-L110
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809528555
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136644/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817284156
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137176/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817251437
**[Test build #137176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137176/testReport)** for PR 31993 at commit [`e64eb75`](https://github.com/apache/spark/commit/e64eb75aede71a5403a4d4436e63b1fcfdeca14d).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820985909
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42045/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r612033519
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala
##########
@@ -89,14 +93,12 @@ object PushDownUtils extends PredicateHelper {
} else {
new StructType()
}
- r.pruneColumns(prunedSchema)
+ val neededFieldNames = neededOutput.map(_.name).toSet
+ r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))
Review comment:
Move [filter logical from `SchemaPruning`](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L38-L39) to `PushDownUtils` to support datasource V2 column pruning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817255876
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41754/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817257211
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41754/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r613943448
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -32,11 +33,10 @@ object SchemaPruning {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
- .map { case root: RootField => StructType(Array(root.field)) }
+ .map { root: RootField => StructType(Array(root.field)) }
.reduceLeft(_ merge _)
- val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
- StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+ StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))
Review comment:
I don't know the details enough to understand why nested column pruning still works after the change here. @viirya can you take a look?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822257596
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42135/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822487303
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137582/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817251437
**[Test build #137176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137176/testReport)** for PR 31993 at commit [`e64eb75`](https://github.com/apache/spark/commit/e64eb75aede71a5403a4d4436e63b1fcfdeca14d).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823715879
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42235/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r617276414
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is:
+ * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned
Review comment:
`the field "b" is pruned` -> `the inner field "b" ...`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-817253696
> Sorry I may miss something. Why it's only a problem in nested column pruning but not column pruning?
Nested column pruning removed the field:
https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823717509
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42235/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya closed pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya closed pull request #31993:
URL: https://github.com/apache/spark/pull/31993
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-810724594
Can we disable column pruning when it is Hive ORC table?
https://github.com/apache/spark/blob/25e7d1ceee8c9f4ecb4ab796f51e9bcbc0500fae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L98-L100
Update `canPruneRelation` to:
```scala
private def canPruneRelation(fsRelation: HadoopFsRelation) = {
fsRelation.fileFormat match {
case _: ParquetFileFormat => true
case _: OrcFileFormat =>
fsRelation.location match {
case c: CatalogFileIndex =>
!c.table.provider.contains(DDLUtils.HIVE_PROVIDER)
case _ => true
}
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31993:
URL: https://github.com/apache/spark/pull/31993#discussion_r614488314
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala
##########
@@ -21,9 +21,12 @@ import org.apache.spark.sql.types._
object SchemaPruning {
/**
- * Filters the schema by the requested fields. For example, if the schema is struct<a:int, b:int>,
- * and given requested field are "a", the field "b" is pruned in the returned schema.
- * Note that schema field ordering at original schema is still preserved in pruned schema.
+ * Prunes the nested schema by the requested fields. For example, if the schema is
+ * struct<a:int, b:int>, and given requested field are "a", the field "b" is pruned in the
+ * returned schema.
+ * Note that:
+ * 1. The schema field ordering at original schema is still preserved in pruned schema.
+ * 2. The top-level fields are not pruned here.
Review comment:
Hmm, doesn't it mean we miss the change to prune top-level columns?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on pull request #31993: [SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails
Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-809943947
As nested column pruning rule is far from the point we get the physical information of ORC files, and this should be a narrow case, it looks okay to me to inform users a possible workaround here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823940506
> @wangyum there are conflicts
Fixed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-822243416
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42135/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823800994
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137707/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-820833460
**[Test build #137449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137449/testReport)** for PR 31993 at commit [`a966bac`](https://github.com/apache/spark/commit/a966bac379ff6fed20c57cf3c748cada9da28b4c).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31993: [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31993:
URL: https://github.com/apache/spark/pull/31993#issuecomment-823701348
**[Test build #137707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137707/testReport)** for PR 31993 at commit [`6112c9d`](https://github.com/apache/spark/commit/6112c9dd357da127646e164cdf6797f5801c1049).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org