You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/22 11:09:15 UTC

[GitHub] [spark] beliefer opened a new pull request #34984: [WIP][SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

beliefer opened a new pull request #34984:
URL: https://github.com/apache/spark/pull/34984

### What changes were proposed in this pull request?
#33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion https://github.com/apache/spark/pull/34712#issuecomment-981402675, we just to fix read/write timestamp ntz to Orc uses UTC timestamp.

The root cause is Orc write/read timestamp with local timezone in default. The local timezone will be changed.
If the Orc writer write timestamp with local timezone(e.g. America/Los_Angeles), when the Orc reader reading the timestamp with other local timezone(e.g. Europe/Amsterdam), the value of timestamp will be different.

If we let the Orc writer write timestamp with UTC timezone, when the Orc reader reading the timestamp with UTC timezone too, the value of timestamp will be correct.

The related Orc source:
https://github.com/apache/orc/blob/3f1e57cf1cebe58027c1bd48c09eef4e9717a9e3/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L525

https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1184

### Why are the changes needed?
Fix the bug about read/write timestamp ntz from/to Orc with different times zone.

### Does this PR introduce _any_ user-facing change?
No. Orc timestamp ntz is a new feature not release yet.

### How was this patch tested?
New tests.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999666818


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r834036341



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
##########
@@ -803,6 +804,35 @@ abstract class OrcQuerySuite extends OrcQueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("SPARK-37463: read/write Timestamp ntz to Orc with different time zone") {
+    val localTimeZone = TimeZone.getDefault
+    try {

Review comment:
       nit: we can use this help function `DateTimeTestUtils.withDefaultTimeZone(DateTimeTestUtils.LA)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999633177


   **[Test build #146486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146486/testReport)** for PR 34984 at commit [`e744dc0`](https://github.com/apache/spark/commit/e744dc086b093cc1487e3775a8b6753ee99cea1b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000319525


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50994/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000605842


   **[Test build #146544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146544/testReport)** for PR 34984 at commit [`443ea93`](https://github.com/apache/spark/commit/443ea93874cd285a9c555b46ee8d47126684f3e3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000651234


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/51024/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000253944


   **[Test build #146518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146518/testReport)** for PR 34984 at commit [`49cf2c9`](https://github.com/apache/spark/commit/49cf2c98c5bec21e8aeec0a577759ea51679658f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000422571


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146518/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000651791


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/51022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999628297


   **[Test build #146480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146480/testReport)** for PR 34984 at commit [`cb80834`](https://github.com/apache/spark/commit/cb80834f5ab98261aa1948c184654d8792690453).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999591573


   **[Test build #146484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146484/testReport)** for PR 34984 at commit [`c8f07f1`](https://github.com/apache/spark/commit/c8f07f13284bce88bf78c7525ec9c5e27266b7c6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r778664524



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -372,12 +386,14 @@ object OrcUtils extends Logging {
       canPruneCols: Boolean,
       dataSchema: StructType,
       resultSchema: StructType,
+      orcCatalystSchema: StructType,
       partitionSchema: StructType,
       conf: Configuration): String = {
     val resultSchemaString = if (canPruneCols) {
-      OrcUtils.orcTypeDescriptionString(resultSchema)
+      OrcUtils.orcTypeDescriptionString(resultSchema, Some(orcCatalystSchema))

Review comment:
       It's quite hacky to put schema evolution code in a method named `orcTypeDescriptionString`. I'd prefer more explicit code similar to `ParquetReadSupport.clipParquetSchema`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r818469815



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -372,12 +385,14 @@ object OrcUtils extends Logging {
       canPruneCols: Boolean,
       dataSchema: StructType,
       resultSchema: StructType,
+      orcCatalystSchema: StructType,
       partitionSchema: StructType,
       conf: Configuration): String = {
     val resultSchemaString = if (canPruneCols) {
-      OrcUtils.orcTypeDescriptionString(resultSchema)

Review comment:
       When reading ntz as ltz, `resultSchema` not contains timestamp NTZ.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000316239


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50994/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000626943


   **[Test build #146549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146549/testReport)** for PR 34984 at commit [`eab6532`](https://github.com/apache/spark/commit/eab6532b5313a3d8895ad9105c76c18be3da5f9c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000141838


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000690807


   **[Test build #146544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146544/testReport)** for PR 34984 at commit [`443ea93`](https://github.com/apache/spark/commit/443ea93874cd285a9c555b46ee8d47126684f3e3).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999517173


   **[Test build #146478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146478/testReport)** for PR 34984 at commit [`f393296`](https://github.com/apache/spark/commit/f393296fd7d93f2f047a9d0b9b415d7e92a79ea3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r834096015



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
##########
@@ -803,6 +804,35 @@ abstract class OrcQuerySuite extends OrcQueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("SPARK-37463: read/write Timestamp ntz to Orc with different time zone") {
+    val localTimeZone = TimeZone.getDefault
+    try {
+      TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+
+      val sqlText = """
+                      |select
+                      | timestamp_ntz '2021-06-01 00:00:00' ts_ntz1,
+                      | timestamp_ntz '1883-11-16 00:00:00.0' as ts_ntz2,
+                      | timestamp_ntz '2021-03-14 02:15:00.0' as ts_ntz3
+                      |""".stripMargin
+
+      val df = sql(sqlText)
+
+      df.write.mode("overwrite").orc("ts_ntz_orc")
+
+      val query = "select * from `orc`.`ts_ntz_orc`"
+
+      Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam").foreach { tz =>

Review comment:
       OK




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1078548624


   @cloud-fan @bersprockets @HyukjinKwon Thank you for all your help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r775758663



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcAtomicColumnVector.java
##########
@@ -111,10 +103,8 @@ public int getInt(int rowId) {
   @Override
   public long getLong(int rowId) {
     int index = getRowIndex(rowId);
-    if (isTimestamp) {
+    if (isTimestamp && timestampData != null) {

Review comment:
       how can it happen that `isTimestamp` is true but `timestampData` is null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r778657315



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -127,10 +127,14 @@ class OrcDeserializer(
         updater.setInt(ordinal, OrcShimUtils.getGregorianDays(value))
 
       case TimestampType => (ordinal, value) =>
-        updater.setLong(ordinal, DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
-
-      case TimestampNTZType => (ordinal, value) =>
-        updater.setLong(ordinal, OrcUtils.fromOrcNTZ(value.asInstanceOf[OrcTimestamp]))
+        if (value.isInstanceOf[OrcTimestamp]) {

Review comment:
       We just need to add one single comment before the if-else
   ```
   // When reading ORC timestamp as Spark LTZ, the ORC timestamp can be either
   // `OrcTimestamp` (which maps to Spark LTZ) or `LongWritable` (which maps
   // to Spark NTZ)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999718303


   **[Test build #146486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146486/testReport)** for PR 34984 at commit [`e744dc0`](https://github.com/apache/spark/commit/e744dc086b093cc1487e3775a8b6753ee99cea1b).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774631874



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -127,10 +127,12 @@ class OrcDeserializer(
         updater.setInt(ordinal, OrcShimUtils.getGregorianDays(value))
 
       case TimestampType => (ordinal, value) =>
-        updater.setLong(ordinal, DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
-
-      case TimestampNTZType => (ordinal, value) =>
-        updater.setLong(ordinal, OrcUtils.fromOrcNTZ(value.asInstanceOf[OrcTimestamp]))
+        if (value.isInstanceOf[OrcTimestamp]) {
+          updater.setLong(ordinal,
+            DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
+        } else {
+          updater.setLong(ordinal, value.asInstanceOf[LongWritable].get)

Review comment:
       when can we reach this branch?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999590285


   **[Test build #146478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146478/testReport)** for PR 34984 at commit [`f393296`](https://github.com/apache/spark/commit/f393296fd7d93f2f047a9d0b9b415d7e92a79ea3).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774253383



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -535,11 +544,11 @@ object OrcUtils extends Logging {
       (ts.getNanos / NANOS_PER_MICROS) % MICROS_PER_MILLIS
   }
 
-  def toOrcNTZ(micros: Long): OrcTimestamp = {
-    val seconds = Math.floorDiv(micros, MICROS_PER_SECOND)
-    val nanos = (micros - seconds * MICROS_PER_SECOND) * NANOS_PER_MICROS
-    val result = new OrcTimestamp(seconds * MILLIS_PER_SECOND)
-    result.setNanos(nanos.toInt)
-    result
-  }
+//  def toOrcNTZ(micros: Long): OrcTimestamp = {

Review comment:
       Yeah. I forgot it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999553112


   **[Test build #146480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146480/testReport)** for PR 34984 at commit [`cb80834`](https://github.com/apache/spark/commit/cb80834f5ab98261aa1948c184654d8792690453).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000116963


   **[Test build #146509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146509/testReport)** for PR 34984 at commit [`49cf2c9`](https://github.com/apache/spark/commit/49cf2c98c5bec21e8aeec0a577759ea51679658f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bersprockets commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

bersprockets commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r784249939



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +275,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def getOrcSchemaString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))

Review comment:
       Thanks. That fixed the perf issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r775760615



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -161,8 +161,9 @@ class OrcFileFormat
         }
 
         val (requestedColIds, canPruneCols) = resultedColPruneInfo.get
+        val orcCatalystSchema = OrcUtils.toCatalystSchema(reader.getSchema)
         val resultSchemaString = OrcUtils.orcResultSchemaString(canPruneCols,
-          dataSchema, resultSchema, partitionSchema, conf)
+          dataSchema, resultSchema, orcCatalystSchema, partitionSchema, conf)

Review comment:
       can you explain more about this change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bersprockets commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

bersprockets commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r783464768



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +275,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def getOrcSchemaString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))

Review comment:
       >I think it could simplify the code
   
   True, but it can be impactful.
   
   Here are similar situations that bypass the O(n^2) complexity:
   
   - `ParquetReadSupport #clipParquetGroupFields` [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L330) and [here](https://github.com/apache/spark/blob/f4543a545cbc4c99e210302858128f30f95011a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L340)
   - `OrcUtils#requestedColumnIds` [here](https://github.com/apache/spark/blob/c5d7c3c17f140bd81ee886893d2e626644ebb6ac/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L253)
   
   In a contrived query (6000 bigint columns, 65 files, 2000 rows per file, single executor) I saw a 40-50% penalty between the PR and the PR's base (08dd010860).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000178133


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999581359


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50954/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999672136


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50962/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999720464






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000711299


   **[Test build #146547 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146547/testReport)** for PR 34984 at commit [`5d7eda6`](https://github.com/apache/spark/commit/5d7eda68b8b0bd7fd5f335ccec05afc91fd3abc8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r775849298



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcAtomicColumnVector.java
##########
@@ -111,10 +103,8 @@ public int getInt(int rowId) {
   @Override
   public long getLong(int rowId) {
     int index = getRowIndex(rowId);
-    if (isTimestamp) {
+    if (isTimestamp && timestampData != null) {

Review comment:
       This happens when read timestamp_ntz as timestamp_ltz.

##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcAtomicColumnVector.java
##########
@@ -111,10 +103,8 @@ public int getInt(int rowId) {
   @Override
   public long getLong(int rowId) {
     int index = getRowIndex(rowId);
-    if (isTimestamp) {
+    if (isTimestamp && timestampData != null) {

Review comment:
       This happens when read `timestamp_ntz` as `timestamp_ltz`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1077705235


   thanks, merging to master/3.3!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r818458114



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -372,12 +385,14 @@ object OrcUtils extends Logging {
       canPruneCols: Boolean,
       dataSchema: StructType,
       resultSchema: StructType,

Review comment:
       Yes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r818432106



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -372,12 +385,14 @@ object OrcUtils extends Logging {
       canPruneCols: Boolean,
       dataSchema: StructType,
       resultSchema: StructType,

Review comment:
       I'm thinking about this again. It's a long-standing issue that the schema evolution is completely defined by the ORC library, but it's non-trivial to fix this and let Spark take control.
   
   So my proposal is: when we generate `resultSchemaString` and set it to the ORC reader, we should replace `TimestmapNTZType` with `LongType` in the catalyst schema, as we know NTZ is stored as int64 in ORC file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774639188



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -127,10 +127,12 @@ class OrcDeserializer(
         updater.setInt(ordinal, OrcShimUtils.getGregorianDays(value))
 
       case TimestampType => (ordinal, value) =>
-        updater.setLong(ordinal, DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
-
-      case TimestampNTZType => (ordinal, value) =>
-        updater.setLong(ordinal, OrcUtils.fromOrcNTZ(value.asInstanceOf[OrcTimestamp]))
+        if (value.isInstanceOf[OrcTimestamp]) {
+          updater.setLong(ordinal,
+            DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
+        } else {
+          updater.setLong(ordinal, value.asInstanceOf[LongWritable].get)

Review comment:
       > when we read NTZ as LTZ?
   
   Yes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000419778


   **[Test build #146518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146518/testReport)** for PR 34984 at commit [`49cf2c9`](https://github.com/apache/spark/commit/49cf2c98c5bec21e8aeec0a577759ea51679658f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000636314






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000237051


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000319525


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50994/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [WIP][SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999545916


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50954/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999678348


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999707490


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50962/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [WIP][SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999553112


   **[Test build #146480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146480/testReport)** for PR 34984 at commit [`cb80834`](https://github.com/apache/spark/commit/cb80834f5ab98261aa1948c184654d8792690453).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r775761535



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +277,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def orcTypeDescriptionString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))
+        if (idx == -1) {
+          s"${quoteIdentifier(f.name)}:${orcTypeDescriptionString(f.dataType)}"
+        } else {
+          s"${quoteIdentifier(f.name)}:" +
+            s"${orcTypeDescriptionString(f.dataType, Some(s2(idx).dataType))}"
+        }
+      }
+      s"struct<${fieldTypes.mkString(",")}>"
+    case (s: StructType, None) =>
       val fieldTypes = s.fields.map { f =>
         s"${quoteIdentifier(f.name)}:${orcTypeDescriptionString(f.dataType)}"
       }
       s"struct<${fieldTypes.mkString(",")}>"
-    case a: ArrayType =>
+    case (a1: ArrayType, Some(a2: ArrayType)) =>
+      s"array<${orcTypeDescriptionString(a1.elementType, Some(a2.elementType))}>"
+    case (a: ArrayType, None) =>
       s"array<${orcTypeDescriptionString(a.elementType)}>"
-    case m: MapType =>
+    case (m1: MapType, Some(m2: MapType)) =>
+      s"map<${orcTypeDescriptionString(m1.keyType, Some(m2.keyType))}," +
+        s"${orcTypeDescriptionString(m1.valueType, Some(m2.valueType))}>"
+    case (m: MapType, None) =>
       s"map<${orcTypeDescriptionString(m.keyType)},${orcTypeDescriptionString(m.valueType)}>"
-    case TimestampNTZType => TypeDescription.Category.TIMESTAMP.getName
-    case _: DayTimeIntervalType => LongType.catalogString
-    case _: YearMonthIntervalType => IntegerType.catalogString
+    case (_: DayTimeIntervalType | _: TimestampNTZType, _) => LongType.catalogString
+    case (_: YearMonthIntervalType, _) => IntegerType.catalogString
+    case (TimestampType, Some(TimestampNTZType)) => LongType.catalogString

Review comment:
       When we read a ORC int32 as Spark long type, do we need to correct the type string as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r782720716



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +275,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def getOrcSchemaString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))

Review comment:
       `....map { f => ....indexWhere }` is widely used in Spark. I think it could simplify the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000218333


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146509/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000215638


   **[Test build #146509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146509/testReport)** for PR 34984 at commit [`49cf2c9`](https://github.com/apache/spark/commit/49cf2c98c5bec21e8aeec0a577759ea51679658f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000622263


   **[Test build #146547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146547/testReport)** for PR 34984 at commit [`5d7eda6`](https://github.com/apache/spark/commit/5d7eda68b8b0bd7fd5f335ccec05afc91fd3abc8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r775850155



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +277,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def orcTypeDescriptionString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))
+        if (idx == -1) {
+          s"${quoteIdentifier(f.name)}:${orcTypeDescriptionString(f.dataType)}"
+        } else {
+          s"${quoteIdentifier(f.name)}:" +
+            s"${orcTypeDescriptionString(f.dataType, Some(s2(idx).dataType))}"
+        }
+      }
+      s"struct<${fieldTypes.mkString(",")}>"
+    case (s: StructType, None) =>
       val fieldTypes = s.fields.map { f =>
         s"${quoteIdentifier(f.name)}:${orcTypeDescriptionString(f.dataType)}"
       }
       s"struct<${fieldTypes.mkString(",")}>"
-    case a: ArrayType =>
+    case (a1: ArrayType, Some(a2: ArrayType)) =>
+      s"array<${orcTypeDescriptionString(a1.elementType, Some(a2.elementType))}>"
+    case (a: ArrayType, None) =>
       s"array<${orcTypeDescriptionString(a.elementType)}>"
-    case m: MapType =>
+    case (m1: MapType, Some(m2: MapType)) =>
+      s"map<${orcTypeDescriptionString(m1.keyType, Some(m2.keyType))}," +
+        s"${orcTypeDescriptionString(m1.valueType, Some(m2.valueType))}>"
+    case (m: MapType, None) =>
       s"map<${orcTypeDescriptionString(m.keyType)},${orcTypeDescriptionString(m.valueType)}>"
-    case TimestampNTZType => TypeDescription.Category.TIMESTAMP.getName
-    case _: DayTimeIntervalType => LongType.catalogString
-    case _: YearMonthIntervalType => IntegerType.catalogString
+    case (_: DayTimeIntervalType | _: TimestampNTZType, _) => LongType.catalogString
+    case (_: YearMonthIntervalType, _) => IntegerType.catalogString
+    case (TimestampType, Some(TimestampNTZType)) => LongType.catalogString

Review comment:
       This should be OK. It converts narrow type to wide type is safe.
   Maybe we need to test it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r775850155



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +277,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def orcTypeDescriptionString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))
+        if (idx == -1) {
+          s"${quoteIdentifier(f.name)}:${orcTypeDescriptionString(f.dataType)}"
+        } else {
+          s"${quoteIdentifier(f.name)}:" +
+            s"${orcTypeDescriptionString(f.dataType, Some(s2(idx).dataType))}"
+        }
+      }
+      s"struct<${fieldTypes.mkString(",")}>"
+    case (s: StructType, None) =>
       val fieldTypes = s.fields.map { f =>
         s"${quoteIdentifier(f.name)}:${orcTypeDescriptionString(f.dataType)}"
       }
       s"struct<${fieldTypes.mkString(",")}>"
-    case a: ArrayType =>
+    case (a1: ArrayType, Some(a2: ArrayType)) =>
+      s"array<${orcTypeDescriptionString(a1.elementType, Some(a2.elementType))}>"
+    case (a: ArrayType, None) =>
       s"array<${orcTypeDescriptionString(a.elementType)}>"
-    case m: MapType =>
+    case (m1: MapType, Some(m2: MapType)) =>
+      s"map<${orcTypeDescriptionString(m1.keyType, Some(m2.keyType))}," +
+        s"${orcTypeDescriptionString(m1.valueType, Some(m2.valueType))}>"
+    case (m: MapType, None) =>
       s"map<${orcTypeDescriptionString(m.keyType)},${orcTypeDescriptionString(m.valueType)}>"
-    case TimestampNTZType => TypeDescription.Category.TIMESTAMP.getName
-    case _: DayTimeIntervalType => LongType.catalogString
-    case _: YearMonthIntervalType => IntegerType.catalogString
+    case (_: DayTimeIntervalType | _: TimestampNTZType, _) => LongType.catalogString
+    case (_: YearMonthIntervalType, _) => IntegerType.catalogString
+    case (TimestampType, Some(TimestampNTZType)) => LongType.catalogString

Review comment:
       This should be OK. It converts narrow type to wide type is safe.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774633892



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -372,12 +395,14 @@ object OrcUtils extends Logging {
       canPruneCols: Boolean,
       dataSchema: StructType,
       resultSchema: StructType,
+      orcCatalystSchema: StructType,
       partitionSchema: StructType,
       conf: Configuration): String = {
     val resultSchemaString = if (canPruneCols) {
-      OrcUtils.orcTypeDescriptionString(resultSchema)
+      OrcUtils.orcTypeDescriptionString(resultSchema, orcCatalystSchema)

Review comment:
       can we use `orcTypeDescription(resultSchema).toString`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000720447


   **[Test build #146549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146549/testReport)** for PR 34984 at commit [`eab6532`](https://github.com/apache/spark/commit/eab6532b5313a3d8895ad9105c76c18be3da5f9c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000622263


   **[Test build #146547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146547/testReport)** for PR 34984 at commit [`5d7eda6`](https://github.com/apache/spark/commit/5d7eda68b8b0bd7fd5f335ccec05afc91fd3abc8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999584114


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50956/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999601539


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146484/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774246633



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -535,11 +544,11 @@ object OrcUtils extends Logging {
       (ts.getNanos / NANOS_PER_MICROS) % MICROS_PER_MILLIS
   }
 
-  def toOrcNTZ(micros: Long): OrcTimestamp = {
-    val seconds = Math.floorDiv(micros, MICROS_PER_SECOND)
-    val nanos = (micros - seconds * MICROS_PER_SECOND) * NANOS_PER_MICROS
-    val result = new OrcTimestamp(seconds * MILLIS_PER_SECOND)
-    result.setNanos(nanos.toInt)
-    result
-  }
+//  def toOrcNTZ(micros: Long): OrcTimestamp = {

Review comment:
       Is this a mistake?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999678348


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000731943


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146549/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774632009



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -127,10 +127,12 @@ class OrcDeserializer(
         updater.setInt(ordinal, OrcShimUtils.getGregorianDays(value))
 
       case TimestampType => (ordinal, value) =>
-        updater.setLong(ordinal, DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
-
-      case TimestampNTZType => (ordinal, value) =>
-        updater.setLong(ordinal, OrcUtils.fromOrcNTZ(value.asInstanceOf[OrcTimestamp]))
+        if (value.isInstanceOf[OrcTimestamp]) {
+          updater.setLong(ordinal,
+            DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
+        } else {
+          updater.setLong(ordinal, value.asInstanceOf[LongWritable].get)

Review comment:
       when we read NTZ as LTZ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000626943


   **[Test build #146549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146549/testReport)** for PR 34984 at commit [`eab6532`](https://github.com/apache/spark/commit/eab6532b5313a3d8895ad9105c76c18be3da5f9c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000669070


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/51024/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000666226


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/51024/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000708385


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999591573


   **[Test build #146484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146484/testReport)** for PR 34984 at commit [`c8f07f1`](https://github.com/apache/spark/commit/c8f07f13284bce88bf78c7525ec9c5e27266b7c6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000116963


   **[Test build #146509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146509/testReport)** for PR 34984 at commit [`49cf2c9`](https://github.com/apache/spark/commit/49cf2c98c5bec21e8aeec0a577759ea51679658f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999629963


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50956/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999630631






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000253944


   **[Test build #146518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146518/testReport)** for PR 34984 at commit [`49cf2c9`](https://github.com/apache/spark/commit/49cf2c98c5bec21e8aeec0a577759ea51679658f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r783737340



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +275,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def getOrcSchemaString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))

Review comment:
       OK. Updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r778659565



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -142,10 +142,10 @@ class OrcFileFormat
 
       val fs = filePath.getFileSystem(conf)
       val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
-      val resultedColPruneInfo =
+      val (resultedColPruneInfo, reader) =
         Utils.tryWithResource(OrcFile.createReader(filePath, readerOptions)) { reader =>
-          OrcUtils.requestedColumnIds(
-            isCaseSensitive, dataSchema, requiredSchema, reader, conf)
+          (OrcUtils.requestedColumnIds(
+            isCaseSensitive, dataSchema, requiredSchema, reader, conf), reader)

Review comment:
       nit
   ```
   val orcSchema = Utils.tryWithResource(...) { reader.getSchema }
   val resultedColPruneInfo = OrcUtils.requestedColumnIds(..., orcSchema)
   ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r818433461



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -372,12 +385,14 @@ object OrcUtils extends Logging {
       canPruneCols: Boolean,
       dataSchema: StructType,
       resultSchema: StructType,
+      orcCatalystSchema: StructType,
       partitionSchema: StructType,
       conf: Configuration): String = {
     val resultSchemaString = if (canPruneCols) {
-      OrcUtils.orcTypeDescriptionString(resultSchema)

Review comment:
       so the code change can be
   ```
   val actualResultSchema = replaceNTZType(resultSchema)
   ... the original code
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #34984:
URL: https://github.com/apache/spark/pull/34984


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999601539


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146484/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000278948


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50994/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000637377


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/51019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000605842


   **[Test build #146544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146544/testReport)** for PR 34984 at commit [`443ea93`](https://github.com/apache/spark/commit/443ea93874cd285a9c555b46ee8d47126684f3e3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000708385


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000714415


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146547/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000218333


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146509/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000422571


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146518/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999590550


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146478/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999633177


   **[Test build #146486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146486/testReport)** for PR 34984 at commit [`e744dc0`](https://github.com/apache/spark/commit/e744dc086b093cc1487e3775a8b6753ee99cea1b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999630631






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000178133


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000653793


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/51022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000731943


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146549/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774632302



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -127,10 +127,12 @@ class OrcDeserializer(
         updater.setInt(ordinal, OrcShimUtils.getGregorianDays(value))
 
       case TimestampType => (ordinal, value) =>
-        updater.setLong(ordinal, DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
-
-      case TimestampNTZType => (ordinal, value) =>
-        updater.setLong(ordinal, OrcUtils.fromOrcNTZ(value.asInstanceOf[OrcTimestamp]))
+        if (value.isInstanceOf[OrcTimestamp]) {
+          updater.setLong(ordinal,
+            DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
+        } else {
+          updater.setLong(ordinal, value.asInstanceOf[LongWritable].get)

Review comment:
       then shall we shift the long value w.r.t. the timezone?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000174249


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r778851718



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -127,10 +127,16 @@ class OrcDeserializer(
         updater.setInt(ordinal, OrcShimUtils.getGregorianDays(value))
 
       case TimestampType => (ordinal, value) =>
-        updater.setLong(ordinal, DateTimeUtils.fromJavaTimestamp(value.asInstanceOf[OrcTimestamp]))
-
-      case TimestampNTZType => (ordinal, value) =>
-        updater.setLong(ordinal, OrcUtils.fromOrcNTZ(value.asInstanceOf[OrcTimestamp]))
+        // When reading as Spark LTZ, the ORC value can be either `OrcTimestamp`
+        // (which maps to Spark LTZ) or `LongWritable` (which maps to Spark NTZ)
+        if (value.isInstanceOf[OrcTimestamp]) {
+          // Spark takes Orc timestamp as timestampLTZ, so read it as the Timestamp.

Review comment:
       we don't need the comments inside if-else now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r823685909



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -280,22 +276,41 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def getOrcSchemaString(
+      dt: DataType, orcDtOpt: Option[DataType] = None): String = (dt, orcDtOpt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val orcDataTypeMap = s2.groupBy(_.name)
+      val fieldTypes = s1.fields.map { f =>
+        if (orcDataTypeMap.contains(f.name)) {
+          val orcFields = orcDataTypeMap(f.name)

Review comment:
       This looks overly complicated, as we need to consider column reorder which is part of schema revolution that should be taken care of by ORC today.
   
   How about we don't allow reading NTZ as LTZ for now? We can support it later when Spark controls the schema evolution for ORC. Then the code can be very simply: just replace TimestampNTZType with LongType in the catalyst schema.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -280,22 +276,41 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def getOrcSchemaString(
+      dt: DataType, orcDtOpt: Option[DataType] = None): String = (dt, orcDtOpt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val orcDataTypeMap = s2.groupBy(_.name)
+      val fieldTypes = s1.fields.map { f =>
+        if (orcDataTypeMap.contains(f.name)) {
+          val orcFields = orcDataTypeMap(f.name)

Review comment:
       This looks overly complicated, as we need to consider column reorder which is part of schema revolution that should be taken care of by ORC today.
   
   How about we don't allow reading NTZ as LTZ for now? We can support it later when Spark controls the schema evolution for ORC. Then the code can be very simple: just replace TimestampNTZType with LongType in the catalyst schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r779599749



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -142,10 +142,11 @@ class OrcFileFormat
 
       val fs = filePath.getFileSystem(conf)
       val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
-      val resultedColPruneInfo =
+      val (resultedColPruneInfo, orcSchema) =

Review comment:
       nit: don't use a tuple here
   ```
   val orcSchema = Utils.tryWithResource...
   val resultedColPruneInfo = ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bersprockets commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

bersprockets commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r782613339



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -279,22 +275,40 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Given a `StructType` object, this methods converts it to corresponding string representation
-   * in ORC.
+   * Given two `StructType` object, this methods converts it to corresponding string representation
+   * in ORC. The second `StructType` used to change the `TimestampNTZType` as LongType in result
+   * schema string when reading `TimestampNTZ` as `TimestampLTZ`.
    */
-  def orcTypeDescriptionString(dt: DataType): String = dt match {
-    case s: StructType =>
+  def getOrcSchemaString(
+      dt: DataType, orcDt: Option[DataType] = None): String = (dt, orcDt) match {
+    case (s1: StructType, Some(s2: StructType)) =>
+      val fieldTypes = s1.fields.map { f =>
+        val idx = s2.fieldNames.indexWhere(caseSensitiveResolution(_, f.name))

Review comment:
       Is this a potential O(n^2) issue? If I am reading correctly, it says "for each field in s1, scan s2 to find the matching field".
   
   Probably not a noticeable issue under most circumstances. But for wide structs with many input files, it could add up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r834037127



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
##########
@@ -803,6 +804,35 @@ abstract class OrcQuerySuite extends OrcQueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("SPARK-37463: read/write Timestamp ntz to Orc with different time zone") {
+    val localTimeZone = TimeZone.getDefault
+    try {
+      TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+
+      val sqlText = """
+                      |select
+                      | timestamp_ntz '2021-06-01 00:00:00' ts_ntz1,
+                      | timestamp_ntz '1883-11-16 00:00:00.0' as ts_ntz2,
+                      | timestamp_ntz '2021-03-14 02:15:00.0' as ts_ntz3
+                      |""".stripMargin
+
+      val df = sql(sqlText)
+
+      df.write.mode("overwrite").orc("ts_ntz_orc")
+
+      val query = "select * from `orc`.`ts_ntz_orc`"
+
+      Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam").foreach { tz =>

Review comment:
       Let's test `DateTimeUtils.outstandingZoneIds`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r834033725



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala
##########
@@ -88,17 +88,16 @@ case class OrcPartitionReaderFactory(
     }
     val filePath = new Path(new URI(file.filePath))
 
-    val resultedColPruneInfo =
-      Utils.tryWithResource(createORCReader(filePath, conf)) { reader =>
-        OrcUtils.requestedColumnIds(
-          isCaseSensitive, dataSchema, readDataSchema, reader, conf)
-      }
+    val orcSchema = Utils.tryWithResource(createORCReader(filePath, conf))(_.getSchema)
+    val resultedColPruneInfo = OrcUtils.requestedColumnIds(
+      isCaseSensitive, dataSchema, readDataSchema, orcSchema, conf)
 
     if (resultedColPruneInfo.isEmpty) {
       new EmptyPartitionReader[InternalRow]
     } else {
       val (requestedColIds, canPruneCols) = resultedColPruneInfo.get
-      OrcUtils.orcResultSchemaString(canPruneCols, dataSchema, resultSchema, partitionSchema, conf)
+      OrcUtils.orcResultSchemaString(canPruneCols,
+        dataSchema, resultSchema, partitionSchema, conf)

Review comment:
       ```suggestion
         OrcUtils.orcResultSchemaString(canPruneCols, dataSchema, resultSchema, partitionSchema, conf)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999590550


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146478/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999589618


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50954/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999626358


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999720464






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34984:
URL: https://github.com/apache/spark/pull/34984#discussion_r774634386



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -372,12 +395,14 @@ object OrcUtils extends Logging {
       canPruneCols: Boolean,
       dataSchema: StructType,
       resultSchema: StructType,
+      orcCatalystSchema: StructType,
       partitionSchema: StructType,
       conf: Configuration): String = {
     val resultSchemaString = if (canPruneCols) {
-      OrcUtils.orcTypeDescriptionString(resultSchema)
+      OrcUtils.orcTypeDescriptionString(resultSchema, orcCatalystSchema)

Review comment:
       I'm confused. Why not update the existing `orcTypeDescriptionString` to map NTZ to ORC int64, instead of adding a new overload?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000669070


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/51024/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000637377


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/51019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999589618


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50954/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000621055


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/51019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000714415


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146547/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [WIP][SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999517173


   **[Test build #146478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146478/testReport)** for PR 34984 at commit [`f393296`](https://github.com/apache/spark/commit/f393296fd7d93f2f047a9d0b9b415d7e92a79ea3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-999601462


   **[Test build #146484 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146484/testReport)** for PR 34984 at commit [`c8f07f1`](https://github.com/apache/spark/commit/c8f07f13284bce88bf78c7525ec9c5e27266b7c6).
    * This patch **fails Java style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

beliefer commented on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000248896


   ping @gengliangwang @MaxGekk cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34984: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34984:
URL: https://github.com/apache/spark/pull/34984#issuecomment-1000653793


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/51022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org