You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/12 15:52:32 UTC

[GitHub] [spark] peter-toth opened a new pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

peter-toth opened a new pull request #29737:
URL: https://github.com/apache/spark/pull/29737


   ### What changes were proposed in this pull request?
   Add support for `orc.force.positional.evolution` config that forces ORC top level column matching by position rather than by name.
   
   ### Why are the changes needed?
   Hive/ORC does support it. 
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   New UT.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490904978



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       I mean, this doesn't look like an ORC-specific thing. Does hive only do this for ORC tables but not Parquet tables?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759605593


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38608/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490895572



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       Sorry, I'm not sure I get your point.
   In `requestedColumnIds()` we map `requiredSchema` to the schema in the ORC files (`orcFieldNames`) and actually Spark is prepared for that the schema in HMS and in the file doesn't always match: https://github.com/apache/spark/pull/29737/files#diff-3fb8426b690ab771c4f67f9cad336498L149 (the file is written by an old version of Hive).
   It turned out that a schema mismatch can happen with newer version of Hive (columns in the file doesn't start with `_col`) too. Because simply setting `orc.force.positional.evolution=true` and then doing a column rename in Hive also results mismatch in Spark and in that case Spark returns `null`s now.
   It seemed a good idea to add support for this setting to our data source but if that is not the good way to deal with the issue please let me know. (I've updated the PR description a bit to make it more clear what I'm trying to fix.)

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       Sorry, I'm not sure I get your point.
   In `requestedColumnIds()` we map `requiredSchema` to the schema in the ORC files (`orcFieldNames`) and actually Spark is prepared for that the schema in HMS and in the file doesn't always match: https://github.com/apache/spark/pull/29737/files#diff-3fb8426b690ab771c4f67f9cad336498L149 (the file is written by an old version of Hive).
   It turned out that a schema mismatch can happen with newer version of Hive (columns in the file doesn't start with `_col`) too. Because simply setting `orc.force.positional.evolution=true` and then doing a column rename in Hive also results mismatch in Spark and in that case Spark returns `null`s now.
   It seemed a good idea to add support for this setting to our data source but if that is not the good way to deal with the issue please let me know.
   (I've updated the PR description a bit to make it more clear what I'm trying to fix.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694744145






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-692043158


   Thanks @HyukjinKwon.
   
   @dongjoon-hyun could you please take a look at this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509378






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509217






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556095088



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {

Review comment:
       I recall we add a prefix for third-party library configs like mentioned above. Is this an exception?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759599713


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38608/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759533409


   **[Test build #134021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134021/testReport)** for PR 29737 at commit [`fbab1e4`](https://github.com/apache/spark/commit/fbab1e4c597ad2fede82892935e665832217ce1b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29737:
URL: https://github.com/apache/spark/pull/29737


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760713228


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134070/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758818174


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38563/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509217






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-693363098


   **[Test build #128765 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128765/testReport)** for PR 29737 at commit [`77fd46a`](https://github.com/apache/spark/commit/77fd46ad8c3887334b98601c2cdeb0326ff8543f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509217


   **[Test build #128593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128593/testReport)** for PR 29737 at commit [`663a469`](https://github.com/apache/spark/commit/663a469d6a27f91555d10134f67db1629238601f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760230532


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38640/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760276618


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38640/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760607105


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556617324



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
-        // This is a ORC file written by Hive, no field names in the physical schema, assume the
-        // physical schema maps to the data scheme by index.
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
+        // This is either an ORC file written by an old version of Hive and there are no field
+        // names in the physical schema, or a file written by a newer version of Hive where
+        // `orc.force.positional.evolution=true` (possibly because columns were renamed so the

Review comment:
       Thanks, fixed it in https://github.com/apache/spark/pull/29737/commits/a074b63fb240c62c7fd09aca1cdc66300b36b2dc




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-695963984


   @dongjoon-hyun Thanks for the references! It seems that the `orc.force.positional.evolution` config was designed for schema evolution, and I think it's better to design schema evolution natively at Spark side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759724004


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134021/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758818174


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38563/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557401225



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala
##########
@@ -81,6 +81,8 @@ case class OrcPartitionReaderFactory(
     val conf = broadcastedConf.value.value
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(conf, isCaseSensitive)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(conf,
+      conf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute, false))

Review comment:
       I've reverted the last commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759709605


   **[Test build #134021 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134021/testReport)** for PR 29737 at commit [`fbab1e4`](https://github.com/apache/spark/commit/fbab1e4c597ad2fede82892935e665832217ce1b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-693586239


   **[Test build #128765 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128765/testReport)** for PR 29737 at commit [`77fd46a`](https://github.com/apache/spark/commit/77fd46ad8c3887334b98601c2cdeb0326ff8543f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557539312



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")

Review comment:
       Ok, changed in https://github.com/apache/spark/pull/29737/commits/51f503c5738a714d6ea77467ac8f7dfba231d989




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760607924


   **[Test build #134070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134070/testReport)** for PR 29737 at commit [`51f503c`](https://github.com/apache/spark/commit/51f503c5738a714d6ea77467ac8f7dfba231d989).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-693363098


   **[Test build #128765 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128765/testReport)** for PR 29737 at commit [`77fd46a`](https://github.com/apache/spark/commit/77fd46ad8c3887334b98601c2cdeb0326ff8543f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691753087


   cc @dongjoon-hyun FYI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694470723


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128833/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758774248


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38563/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694743616


   **[Test build #128871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128871/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-693587864






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557538838



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {

Review comment:
       Sure, added in https://github.com/apache/spark/pull/29737/commits/51f503c5738a714d6ea77467ac8f7dfba231d989




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694743616


   **[Test build #128871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128871/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556617988



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq(1 -> 2).toDF("c1", "c2").write.orc(path)

Review comment:
       Yes, changed in https://github.com/apache/spark/pull/29737/commits/1765c4e6bb1eabd9aa7767212b109f44b50794a7




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760276618


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38640/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760641018


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38656/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705421777


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758918485


   **[Test build #133976 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133976/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509378






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490349666



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
-        // This is a ORC file written by Hive, no field names in the physical schema, assume the
-        // physical schema maps to the data scheme by index.
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
+        // This is either an ORC file written by an old versions of Hive and there are no field
+        // names in the physical schema, or a file written by a newer versions of Hive where
+        // `orc.force.positional.evolution=true` and columns were renamed so the physical schema

Review comment:
       Indeed, thanks. Fixed the comment.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
-        // This is a ORC file written by Hive, no field names in the physical schema, assume the
-        // physical schema maps to the data scheme by index.
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
+        // This is either an ORC file written by an old versions of Hive and there are no field

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-696277091


   All right, thanks @dongjoon-hyun for the details. I guess I should close this PR for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694744145






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758180875


   @redsanket .
   Apache Spark 3.2.0 is currently using 1.6.6 which is the latest one. I believe we can re-evaluate and resume this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556701853



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       Oh I see. Yeah, you're right. The hadoop conf here is cleaned up. If users specify `spark.hadoop.orc.force.positional.evolution`, it will be removed the prefix and added into this conf.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth edited a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth edited a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705340785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-696503430


   Thank you so much, @peter-toth and @cloud-fan .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557525716



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+
+                val expected = if (forcePositionalEvolution) {
+                  correctAnswer
+                } else {
+                  Seq(Row(null, null), Row(null, null), Row(null, null))
+                }
+
+                checkAnswer(spark.table("t"), expected)
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  test("SPARK-32864: Support ORC forced positional evolution with partitioned table") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2, 1), (3, 4, 2), (5, 6, 3)).toDF("c1", "c2", "p")
+              .write.partitionBy("p").orc(path)
+            val correctAnswer = Seq(Row(1, 2, 1), Row(3, 4, 2), Row(5, 6, 3))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(
+                  s"""
+                     |CREATE EXTERNAL TABLE t(c3 INT, c4 INT)
+                     |PARTITIONED BY (p int)
+                     |STORED AS ORC
+                     |LOCATION '$path'
+                     |""".stripMargin)
+                sql(s"MSCK REPAIR TABLE t")

Review comment:
       nit. `s"` -> `"`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r491080641



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {

Review comment:
       We want to use this config as it is? Or append usual prefix `spark.hadoop`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691540269






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759724004


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134021/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509378






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
bersprockets commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r489034283



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,24 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+        OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> "true") {
+        withTempPath { f =>
+          val path = f.getCanonicalPath
+          Seq(1 -> 2).toDF("c1", "c2").write.orc(path)
+          checkAnswer(spark.read.orc(path), Row(1, 2))
+
+          withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { // default since 2.3.0
+            withTable("t") {
+              sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+              checkAnswer(spark.table("t"), Row(1, 2))

Review comment:
       Should there also be a test with orc.force.positional.evolution=false to verify that the answer is Row(null, null)? Or does that test already incidentally exist elsewhere?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,11 +142,12 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
         // This is a ORC file written by Hive, no field names in the physical schema, assume the

Review comment:
       >This is a ORC file written by Hive, no field names in the physical schema
   
   The comment should probably reflect your change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557520789



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+
+                val expected = if (forcePositionalEvolution) {
+                  correctAnswer
+                } else {
+                  Seq(Row(null, null), Row(null, null), Row(null, null))
+                }
+
+                checkAnswer(spark.table("t"), expected)
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  test("SPARK-32864: Support ORC forced positional evolution with partitioned table") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2, 1), (3, 4, 2), (5, 6, 3)).toDF("c1", "c2", "p")
+              .write.partitionBy("p").orc(path)
+            val correctAnswer = Seq(Row(1, 2, 1), Row(3, 4, 2), Row(5, 6, 3))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {

Review comment:
       ditto.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-693587864






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694890157






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509378






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490895572



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       Sorry, I'm not sure I get your point.
   In `requestedColumnIds()` we map `requiredSchema` to the schema in the ORC files (`orcFieldNames`) and actually Spark is prepared that the schema in HMS and in the file doesn't always match: https://github.com/apache/spark/pull/29737/files#diff-3fb8426b690ab771c4f67f9cad336498L149 (the file is written by an old version of Hive).
   It turned out that a schema mismatch can happen with newer version of Hive (columns in the file doesn't start with `_col`) too. Because simply setting `orc.force.positional.evolution=true` and then doing a column rename in Hive also results mismatch in Spark and in that case Spark returns `null`s now.
   It seemed a good idea to add support for this setting to our data source but if that is not the good way to deal with the issue please let me know.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760208278


   **[Test build #134054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134054/testReport)** for PR 29737 at commit [`b12dfef`](https://github.com/apache/spark/commit/b12dfeffc7158859738c2c04ae111d75626bfc33).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r555932085



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {

Review comment:
       Could you add a partitioned table test case additionally as a separate test case because it's the main use case for this option, @peter-toth ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759605593


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38608/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760392075






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760324343


   **[Test build #134059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134059/testReport)** for PR 29737 at commit [`51f503c`](https://github.com/apache/spark/commit/51f503c5738a714d6ea77467ac8f7dfba231d989).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760249548


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38640/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556682990



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       At the last commit, this is added. But, I think this override the user's configuration silently. Is this correct?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490895572



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       Sorry, I'm not sure I get your point.
   In `requestedColumnIds()` we map `requiredSchema` to the schema in the ORC files (`orcFieldNames`) and actually Spark is prepared for that the schema in HMS and in the file doesn't always match: https://github.com/apache/spark/pull/29737/files#diff-3fb8426b690ab771c4f67f9cad336498L149 (the file is written by an old version of Hive).
   It turned out that a schema mismatch can happen with newer version of Hive (columns in the file doesn't start with `_col`) too. Because simply setting `orc.force.positional.evolution=true` and then doing a column rename in Hive also results mismatch in Spark and in that case Spark returns `null`s now.
   It seemed a good idea to add support for this setting to our data source but if that is not the good way to deal with the issue please let me know.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759594590


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38608/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694470713


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557400904



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       Thanks @dongjoon-hyun, I see it now where the cleaning happens and reverted the last commit.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694470713






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758742498


   **[Test build #133976 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133976/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r555961296



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq(1 -> 2).toDF("c1", "c2").write.orc(path)

Review comment:
       Could you make it three rows instead of one row, please?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758742498


   **[Test build #133976 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133976/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556617838



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {

Review comment:
       Sure, added in https://github.com/apache/spark/pull/29737/commits/1765c4e6bb1eabd9aa7767212b109f44b50794a7.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-696277091


   All right, thanks @dongjoon-hyun for the details. I guess I should close this PR for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705340785


   @dongjoon-hyun, @cloud-fan those 2 issues has been solved in ORC 1.5.12. Do you think we can resume this PR?
   
   @cloud-fan, I understand that this PR doesn't give a full blown schema evolution support natively in Spark, but it would definitely help a few users who has been using this flag in Hive and would like to use the same in Spark too.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705421777


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth edited a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth edited a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705346151


   > which ORC version spark is using?
   
   We started to use 1.5.12 recently: https://github.com/apache/spark/commit/aa6657981aefae8067672d2c99ca560b6179b723


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694741165


   cc @cloud-fan, @dongjoon-hyun, @maropu, @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-693363739






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760351535


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134059/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758932756


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133976/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509217


   **[Test build #128593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128593/testReport)** for PR 29737 at commit [`663a469`](https://github.com/apache/spark/commit/663a469d6a27f91555d10134f67db1629238601f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557523160



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")

Review comment:
       If you don't mind, can we use `c3 INT, c2 INT` instead of `c3 INT, c4 INT`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556690541



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       Please see line 162. `hadoopConf` is a cleaned one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694320076


   **[Test build #128833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128833/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760208278


   **[Test build #134054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134054/testReport)** for PR 29737 at commit [`b12dfef`](https://github.com/apache/spark/commit/b12dfeffc7158859738c2c04ae111d75626bfc33).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-696503430


   Thank you so much, @peter-toth and @cloud-fan .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557539441



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)

Review comment:
       Ok, added in https://github.com/apache/spark/pull/29737/commits/51f503c5738a714d6ea77467ac8f7dfba231d989




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-693363739






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557539558



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+
+                val expected = if (forcePositionalEvolution) {
+                  correctAnswer
+                } else {
+                  Seq(Row(null, null), Row(null, null), Row(null, null))
+                }
+
+                checkAnswer(spark.table("t"), expected)
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  test("SPARK-32864: Support ORC forced positional evolution with partitioned table") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2, 1), (3, 4, 2), (5, 6, 3)).toDF("c1", "c2", "p")

Review comment:
       Ok, added in https://github.com/apache/spark/pull/29737/commits/51f503c5738a714d6ea77467ac8f7dfba231d989

##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+
+                val expected = if (forcePositionalEvolution) {
+                  correctAnswer
+                } else {
+                  Seq(Row(null, null), Row(null, null), Row(null, null))
+                }
+
+                checkAnswer(spark.table("t"), expected)
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  test("SPARK-32864: Support ORC forced positional evolution with partitioned table") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2, 1), (3, 4, 2), (5, 6, 3)).toDF("c1", "c2", "p")
+              .write.partitionBy("p").orc(path)
+            val correctAnswer = Seq(Row(1, 2, 1), Row(3, 4, 2), Row(5, 6, 3))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(
+                  s"""
+                     |CREATE EXTERNAL TABLE t(c3 INT, c4 INT)
+                     |PARTITIONED BY (p int)
+                     |STORED AS ORC
+                     |LOCATION '$path'
+                     |""".stripMargin)
+                sql(s"MSCK REPAIR TABLE t")

Review comment:
       Thanks, fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
bersprockets commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490289663



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
-        // This is a ORC file written by Hive, no field names in the physical schema, assume the
-        // physical schema maps to the data scheme by index.
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
+        // This is either an ORC file written by an old versions of Hive and there are no field
+        // names in the physical schema, or a file written by a newer versions of Hive where
+        // `orc.force.positional.evolution=true` and columns were renamed so the physical schema

Review comment:
       Super tiny nit (feel free to ignore if it doesn't make sense to you). I suggest:
   
   `orc.force.positional.evolution=true` (possibly because columns were renamed so the physical schema doesn't match the data schema).
   
   Your current version of the comment implies orc.force.positional.evolution=true will happen when the columns are renamed. But you can set orc.force.positional.evolution=true anytime, and you can rename the columns and neglect to set orc.force.positional.evolution, correct?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
-        // This is a ORC file written by Hive, no field names in the physical schema, assume the
-        // physical schema maps to the data scheme by index.
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
+        // This is either an ORC file written by an old versions of Hive and there are no field

Review comment:
       Super tiny nit: "written by an old versions of Hive" should be
   written by an old version of Hive
   or
   written by old versions of Hive




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758798715


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38563/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490808895



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       this doesn't look like a reasonable config for a data source, but more like a config for the upper level that maps metastore schema with the ORC files.
   
   Do we have a good story for the schema evolution of spark file sources?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth edited a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth edited a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705340785


   @dongjoon-hyun, @cloud-fan those 2 issues has been solved in ORC 1.5.12. Do you think we can resume this PR?
   
   @cloud-fan, I understand that this PR doesn't give a full-blown schema evolution support natively in Spark, but it would definitely help a few users who has been using this flag in Hive and would like to use the same in Spark too.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760351535


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134059/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694320789






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694890157






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r491077606



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       I see your point now, but this is an ORC specific setting in Hive. With parquet tables you get `NULL` on a renamed column:
   ```
   > set orc.force.positional.evolution;
   +--------------------------------------+
   |                 set                  |
   +--------------------------------------+
   | orc.force.positional.evolution=true  |
   +--------------------------------------+
   > create external table t2 (c1 string, c2 string) stored as parquet;
   > insert into t2 values ('foo', 'bar');
   > alter table t2 change c1 c3 string;
   > select * from t2;
   +--------+--------+
   | t2.c3  | t2.c2  |
   +--------+--------+
   | NULL   | bar    |
   +--------+--------+
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-695725164


   cc @gatorsmile and @dbtsai 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694469923


   **[Test build #128833 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128833/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694320789






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694888785


   **[Test build #128871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128871/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691540620






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705346151


   > which ORC version spark is using?
   
   We started to use 1.5.12 recently:
   https://github.com/apache/spark/commit/aa6657981aefae8067672d2c99ca560b6179b723


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705342256


   which ORC version spark is using?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r489386143



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,24 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+        OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> "true") {
+        withTempPath { f =>
+          val path = f.getCanonicalPath
+          Seq(1 -> 2).toDF("c1", "c2").write.orc(path)
+          checkAnswer(spark.read.orc(path), Row(1, 2))
+
+          withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { // default since 2.3.0
+            withTable("t") {
+              sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+              checkAnswer(spark.table("t"), Row(1, 2))

Review comment:
       Thanks. I've extended the test with that case.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,11 +142,12 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
         // This is a ORC file written by Hive, no field names in the physical schema, assume the

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760346940


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38645/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556689276



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       This is added at the last commit. However, we already remove `spark.hadoop.` prefix from Spark side already, don't we?
   - https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L96-L106




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556818252



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       Hmm, I might get something wrong but I don't think that `hadoopConf` is cleaned here as if I remove this line the UTs fail. As far as I see the only reference to `SparkHadoopUtil.appendSparkHadoopConfigs` is from`spark-hive` [HiveUtils.newTemporaryConfiguration](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L547).
   
   I can revert the prefix in https://github.com/apache/spark/pull/29737/files#diff-e14fd8725cf71eee7b34fa299c2f3abe5a0033f9abce9de4c7e081ba57991b0bR146 though.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705421777


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556619121



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {

Review comment:
       Well, I think you might be right, I've prefixed the config in https://github.com/apache/spark/pull/29737/commits/fbab1e4c597ad2fede82892935e665832217ce1b. Let me know it is ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556250822



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
     if (orcFieldNames.isEmpty) {
       // SPARK-8501: Some old empty ORC files always have an empty schema stored in their footer.
       None
     } else {
-      if (orcFieldNames.forall(_.startsWith("_col"))) {
-        // This is a ORC file written by Hive, no field names in the physical schema, assume the
-        // physical schema maps to the data scheme by index.
+      if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
+        // This is either an ORC file written by an old version of Hive and there are no field
+        // names in the physical schema, or a file written by a newer version of Hive where
+        // `orc.force.positional.evolution=true` (possibly because columns were renamed so the

Review comment:
       > a file written by a newer version of Hive where `orc.force.positional.evolution=true`
   
   Does it read correctly? It reaches here just because users enable the config, right? I don't see we check if the file is written `orc.force.positional.evolution=true`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth closed pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth closed pull request #29737:
URL: https://github.com/apache/spark/pull/29737


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760392075






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557539044



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+
+                val expected = if (forcePositionalEvolution) {
+                  correctAnswer
+                } else {
+                  Seq(Row(null, null), Row(null, null), Row(null, null))
+                }
+
+                checkAnswer(spark.table("t"), expected)
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  test("SPARK-32864: Support ORC forced positional evolution with partitioned table") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2, 1), (3, 4, 2), (5, 6, 3)).toDF("c1", "c2", "p")
+              .write.partitionBy("p").orc(path)
+            val correctAnswer = Seq(Row(1, 2, 1), Row(3, 4, 2), Row(5, 6, 3))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {

Review comment:
       Added in https://github.com/apache/spark/pull/29737/commits/51f503c5738a714d6ea77467ac8f7dfba231d989




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557524551



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)

Review comment:
       Can we add a null row like `(null, null)` additionally?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758932756


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133976/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760700894


   **[Test build #134070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134070/testReport)** for PR 29737 at commit [`51f503c`](https://github.com/apache/spark/commit/51f503c5738a714d6ea77467ac8f7dfba231d989).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557520537



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {

Review comment:
       Can we a test coverage for `false`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705342256


   which ORC version spark is using?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556839642



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       Parquet is the same. Please see `ParquetFileFormat.scala`. In this context, the Hadoop Conf should have the original conf key, not `spark.hadoop` prefixed ones.
   ```
         lazy val footerFileMetaData =
           ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS).getFileMetaData
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-759533409


   **[Test build #134021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134021/testReport)** for PR 29737 at commit [`fbab1e4`](https://github.com/apache/spark/commit/fbab1e4c597ad2fede82892935e665832217ce1b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth closed pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth closed pull request #29737:
URL: https://github.com/apache/spark/pull/29737


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-695963984


   @dongjoon-hyun Thanks for the references! It seems that the `orc.force.positional.evolution` config was designed for schema evolution, and I think it's better to design schema evolution natively at Spark side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r557525279



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -313,4 +314,71 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2), (3, 4), (5, 6)).toDF("c1", "c2").write.orc(path)
+            val correctAnswer = Seq(Row(1, 2), Row(3, 4), Row(5, 6))
+            checkAnswer(spark.read.orc(path), correctAnswer)
+
+            withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
+              withTable("t") {
+                sql(s"CREATE EXTERNAL TABLE t(c3 INT, c4 INT) STORED AS ORC LOCATION '$path'")
+
+                val expected = if (forcePositionalEvolution) {
+                  correctAnswer
+                } else {
+                  Seq(Row(null, null), Row(null, null), Row(null, null))
+                }
+
+                checkAnswer(spark.table("t"), expected)
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  test("SPARK-32864: Support ORC forced positional evolution with partitioned table") {
+    Seq("native", "hive").foreach { orcImpl =>
+      Seq(true, false).foreach { forcePositionalEvolution =>
+        withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> forcePositionalEvolution.toString) {
+          withTempPath { f =>
+            val path = f.getCanonicalPath
+            Seq((1, 2, 1), (3, 4, 2), (5, 6, 3)).toDF("c1", "c2", "p")

Review comment:
       Let's add one additional null row. Maybe, `(null, null, 4)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760655455


   Thank you, @peter-toth and all . Merged to master.
   Sorry for the delay on this contribution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [WIP][SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-691509378






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760363631


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38645/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-705340785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556835783



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,6 +160,9 @@ class OrcFileFormat
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(hadoopConf,
+      hadoopConf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute,
+        false))

Review comment:
       @peter-toth . You should not write a UT like `withSQLConf("spark.hadoop.xxx")`. `withSQLConf` is a testing framework only.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] redsanket commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
redsanket commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-758178062


   This looks like a useful addition, is there any update on this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694320076


   **[Test build #128833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128833/testReport)** for PR 29737 at commit [`7143cde`](https://github.com/apache/spark/commit/7143cdecb30c29098fc6d4296fa3f2da2b5129ec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760641018


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38656/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-694741458


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760607924


   **[Test build #134070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134070/testReport)** for PR 29737 at commit [`51f503c`](https://github.com/apache/spark/commit/51f503c5738a714d6ea77467ac8f7dfba231d989).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r556690938



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala
##########
@@ -81,6 +81,8 @@ case class OrcPartitionReaderFactory(
     val conf = broadcastedConf.value.value
 
     OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(conf, isCaseSensitive)
+    OrcConf.FORCE_POSITIONAL_EVOLUTION.setBoolean(conf,
+      conf.getBoolean("spark.hadoop." + OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute, false))

Review comment:
       ditto. This looks wrong.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r555932085



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##########
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {

Review comment:
       Could you add a partitioned table also because it's the main use case for this option, @peter-toth ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760390045


   **[Test build #134054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134054/testReport)** for PR 29737 at commit [`b12dfef`](https://github.com/apache/spark/commit/b12dfeffc7158859738c2c04ae111d75626bfc33).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760807396


   Thanks @dongjoon-hyun and all for the review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29737:
URL: https://github.com/apache/spark/pull/29737#issuecomment-760713228


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134070/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org