You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/20 09:51:30 UTC

[GitHub] [spark] awdavidson opened a new pull request, #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

awdavidson opened a new pull request, #38312:
URL: https://github.com/apache/spark/pull/38312

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
   ```
   Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
   ```
   https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`
   
   Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Added unit test covering this scenario.
   Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson closed pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson closed pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
URL: https://github.com/apache/spark/pull/38312


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1376923991

   @cloud-fan @LuciferYang @xinrong-meng any chance we can get this regression fix into Spark 3.4?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1288497689

   Shall we make it a generate strategy? If we don't recognize the logical parquet type, just ignore it and read as the underlying physical type. cc @sadikovi 
   
   One problem I can see is bad backward compatibility. Once we support a new logical type in the future, we must make a breaking change as the reader will return a different data type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "awdavidson (via GitHub)" <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1095663349


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4946,6 +4953,8 @@ class SQLConf extends Serializable with Logging {
 
   def parquetInferTimestampNTZEnabled: Boolean = getConf(PARQUET_INFER_TIMESTAMP_NTZ_ENABLED)
 
+  def parquetTimestampNTZEnabled: Boolean = getConf(PARQUET_TIMESTAMP_NTZ_ENABLED)
+

Review Comment:
   Thank you, this has now be addressed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1001531184


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   So this is not a valid workaround. @sunchao how is your feeling about restoring earlier useful behaviour?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002884323


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          // SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without
+          // timezone awareness to address behaviour regression introduced by SPARK-34661

Review Comment:
   Can we do a truncation and still read it as timestamp type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1288553317

   > One problem I can see is bad backward compatibility.
   
   Meaning once we make Spark 3.x return longs here, supporting nano Timestamps must be moved to Spark 4.x?
   
   But with a configuration flag like `spark.sql.parquet.timestampNTZ.enabled`, the user can control the break and Spark 3.x is free to support this in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025412863


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
     }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
+    val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet")
+    val data = spark.read.parquet(testDataPath.toString).select("birthday")
+
+    assert(data.schema.fields.head.dataType == LongType)
+    assert(data.take(1).head.getAs[Long](0) == 1668537129000000000L)

Review Comment:
   All rows have the same timestamp



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1001477809


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   @EnricoMi so it is possible to use `spark.read.schema(..)` as a workaround, however, you end up loosing functionality like `mergeSchema` which will automatically handle schema evolution etc and potentially will need to know the entire schema up front if all columns are required. Also for other consumers/users, especially in the exploratory analysis space, it will require them to have a better understanding of the underlying data structure before they can use it and this gets more difficult when the file is extremely wide.
   
   I can imagine developers creating otherways to avoid the nuisance; which seems a bit crazy considering that functionality already exists.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1017736181


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession {
 }
 
 class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
+  testSchemaInference[Tuple1[Long]](

Review Comment:
   @cloud-fan moving Parquet from `1.10.1` to `1.12.3` introduced this regression where Spark 3.1 returned `LongType` and Spark 3.2 fails on illegal type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1099564159


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3775,6 +3775,13 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_PARQUET_NANOS_AS_LONG = buildConf("spark.sql.legacy.parquet.nanosAsLong")
+    .internal()
+    .doc("When true, the Parquet's nanos precision timestamps are converted to SQL long values.")
+    .version("3.2.3")

Review Comment:
   @awdavidson I realised that we already released Spark 3.2.3.
   
   Can you make a PR to fix this to 3.2.4?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002187497


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   Make sense. I'm OK with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025422089


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
     }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
+    val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet")
+    val data = spark.read.parquet(testDataPath.toString).select("birthday")
+
+    assert(data.schema.fields.head.dataType == LongType)
+    assert(data.take(1).head.getAs[Long](0) == 1668537129000000000L)

Review Comment:
   Can we have true nano second values to be sure they are not zeroed by the loader but loaded as in the file, e.g. `1665532812012345678L`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025413969


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
     }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {

Review Comment:
   nit: other tests in this file follow the naming convention "SPARK-XXXXX: ..."
   
   ```suggestion
     test("SPARK-40819: ability to read parquet file with TIMESTAMP(NANOS, true)") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1000260756


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   Comment has been added.
   
   I agree, full support is a breaking change and my aim for now is to address the regression and not introduce new functionality/support as that would require further validation and testing of other components e.g. time based functions  etc. However, definitely something that should be considered in future development



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1000910156


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   I see. It's a bit weird though that we read nanos as long, but not timestamp. I'm not sure whether this should be considered as a regression, or that the previous behavior before 3.2 is merely unintended. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1415376864

   Mind rebasing this please? Let's get this in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025353875


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
     }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
+    val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet")

Review Comment:
   Can we reuse 
   
   https://github.com/apache/spark/blob/f01a8db4bcf09c4975029e85722053ff82f8a355/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala#L462-L467



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1000890475


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   @sunchao so for type such as
   
   ```
   message root {
     required int64 _1 (TIMESTAMP(NANOS,true));
   }
   ```
   
   the `descriptor.getPrimitiveType().getPrimitiveTypeName()` is INT64; as the `sparkType` is `LongType` it returns `new LongUpdate();` 
   
   https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L105
   
   Therefore no change is required from this perspective as it continues to be handled in the same way prior to https://github.com/apache/spark/pull/31776



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1095651650


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4946,6 +4953,8 @@ class SQLConf extends Serializable with Logging {
 
   def parquetInferTimestampNTZEnabled: Boolean = getConf(PARQUET_INFER_TIMESTAMP_NTZ_ENABLED)
 
+  def parquetTimestampNTZEnabled: Boolean = getConf(PARQUET_TIMESTAMP_NTZ_ENABLED)
+

Review Comment:
   Looks like this sneaked in while rebasing...
   
   Renamed recently here: https://github.com/apache/spark/commit/ae24327e8984e7ee8ba948e3be0b6b2f28df68d0#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764L4946-R4947



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002966819


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          // SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without
+          // timezone awareness to address behaviour regression introduced by SPARK-34661

Review Comment:
   Yes, the main purpose of the nano timestamp is the nano precision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1001088325


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   Yes, the (earlier) existing behaviour was useful and should be restored until properly supported as typed nano timestamp. Unless a workaround can be found that restores the earlier behaviour.
   
   Does providing a schema (`spark.read.schema(...)`) with long type override the parquet timestamp type from the parquet file? Would that be a workaround?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
attilapiros commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1322543050

   @cloud-fan what about introducing a new sql flag to keep the old behaviour (which could be false by default)?
   May I ask what version is targeted for the full support of nanoseconds precisions? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025413256


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
     }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
+    val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet")

Review Comment:
   đź‘Ť updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] address timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1000224749


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   Ideally, this case would be merged with above case, but that would require `TimestampType` and `TimestampNTZType` to support nanos, which is a bigger change.
   
   This `case` deserves a comment that nanos are not supported as `TimestampType` but as `LongType`, without any timezone awareness.
   
   Supporting nanos as `TimestampType` in the future looks like a breaking change then (Spark 4.x?). Or another `TimestampNSType` like `TimstampNTZType` could be introduced.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1293507848

   @cloud-fan @LuciferYang any update/response regarding this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38312: [SPARK-40819][SQL] address timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1285062141

   @sunchao @HyukjinKwon @LuciferYang this is a regression introduced by #31776 since 3.2.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1004137291


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession {
 }
 
 class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
+  testSchemaInference[Tuple1[Long]](

Review Comment:
   So I've been looking further into it, it's because the message is different between `1.10.1` and `1.12.3` - meaning the test would need to be different.
   
   In `1.10.1` the message is
   
   ```
   message schema {
     required int64 attribute;
   }
   ```
   
   where as `1.12.3` the message is the same as the unit test
   
   ```
   message schema {
     required int64 attribute (TIMESTAMP(NANOS,true));
   }
   ```
   
   So in Spark 3.1.0 you end up hitting this block with returns a `LongType` https://github.com/apache/spark/blob/branch-3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L146
   
   where as since 3.2 you hit https://github.com/apache/spark/blob/branch-3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L174 because a case for `TimeUnit.NANOS` is not covered



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002966819


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          // SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without
+          // timezone awareness to address behaviour regression introduced by SPARK-34661

Review Comment:
   Yes, the mere purpose of this exercise is to get access to the nano precision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1346164923

   @LuciferYang @cloud-fan all comments have been answered


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1095704158


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -49,24 +49,28 @@ import org.apache.spark.sql.types._
  * @param caseSensitive Whether use case sensitive analysis when comparing Spark catalyst read
  *                      schema with Parquet schema.
  * @param inferTimestampNTZ Whether TimestampNTZType type is enabled.
+ * @param nanosAsLong Whether timestamps with nanos are converted to long.
  */
 class ParquetToSparkSchemaConverter(
     assumeBinaryIsString: Boolean = SQLConf.PARQUET_BINARY_AS_STRING.defaultValue.get,
     assumeInt96IsTimestamp: Boolean = SQLConf.PARQUET_INT96_AS_TIMESTAMP.defaultValue.get,
     caseSensitive: Boolean = SQLConf.CASE_SENSITIVE.defaultValue.get,
     inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get) {
+    nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) {

Review Comment:
   ```suggestion
       inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get,
       nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) {
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -65,12 +68,14 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession {
       caseSensitive: Boolean = false,
       inferTimestampNTZ: Boolean = true,
       sparkReadSchema: Option[StructType] = None,
-      expectedParquetColumn: Option[ParquetColumn] = None): Unit = {
+      expectedParquetColumn: Option[ParquetColumn] = None,
+      nanosAsLong: Boolean = false): Unit = {
     val converter = new ParquetToSparkSchemaConverter(
       assumeBinaryIsString = binaryAsString,
       assumeInt96IsTimestamp = int96AsTimestamp,
       caseSensitive = caseSensitive,
       inferTimestampNTZ = inferTimestampNTZ)
+      nanosAsLong = nanosAsLong)

Review Comment:
   ```suggestion
         inferTimestampNTZ = inferTimestampNTZ,
         nanosAsLong = nanosAsLong)
   ```



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##########
@@ -454,13 +459,15 @@ object ParquetFileFormat extends Logging {
     val assumeBinaryIsString = sparkSession.sessionState.conf.isParquetBinaryAsString
     val assumeInt96IsTimestamp = sparkSession.sessionState.conf.isParquetINT96AsTimestamp
     val inferTimestampNTZ = sparkSession.sessionState.conf.parquetInferTimestampNTZEnabled
+    val nanosAsLong = sparkSession.sessionState.conf.legacyParquetNanosAsLong
 
     val reader = (files: Seq[FileStatus], conf: Configuration, ignoreCorruptFiles: Boolean) => {
       // Converter used to convert Parquet `MessageType` to Spark SQL `StructType`
       val converter = new ParquetToSparkSchemaConverter(
         assumeBinaryIsString = assumeBinaryIsString,
         assumeInt96IsTimestamp = assumeInt96IsTimestamp,
         inferTimestampNTZ = inferTimestampNTZ)
+        nanosAsLong = nanosAsLong)

Review Comment:
   ```suggestion
           inferTimestampNTZ = inferTimestampNTZ,
           nanosAsLong = nanosAsLong)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1418782646

   Merged to master and branch-3.4.
   
   It has a conflict w/ branch-3.3 and branch-3.2. Would you mind creating backporting PRs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
URL: https://github.com/apache/spark/pull/38312


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "awdavidson (via GitHub)" <gi...@apache.org>.
awdavidson commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1418788172

   > Merged to master and branch-3.4.
   > 
   > It has a conflict w/ branch-3.3 and branch-3.2. Would you mind creating backporting PRs?
   
   Sure, I’ll do this asap


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002902562


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          // SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without
+          // timezone awareness to address behaviour regression introduced by SPARK-34661

Review Comment:
   I do not think this is a good idea as the precision will be lost which is extremely important for high frequency time series. 
   
   I haven’t verified but end users/developers would still be able to `.cast(Timestamp)` which I believe would truncate the timestamp; allowing developers to make that decision makes more sense then forcing the loss of precision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1003294743


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession {
 }
 
 class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
+  testSchemaInference[Tuple1[Long]](

Review Comment:
   Can this case run passed before Spark 3.2, in my impression, Parquet 1.10.1 used by Spark 3.1 does not support nanos type, does it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025407302


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
     }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
+    val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet")
+    val data = spark.read.parquet(testDataPath.toString).select("birthday")
+
+    assert(data.schema.fields.head.dataType == LongType)
+    assert(data.take(1).head.getAs[Long](0) == 1668537129000000000L)

Review Comment:
   Shall we sort the read dataframe to be guarantee we compare the first row (Unless all rows have the same timestamp)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1315558808

   > @awdavidson I would like to understand the use case a bit better. Is the parquet file was written by an earlier Spark (version < 3.2) and does the error comes when that parquet file is read back with a latter Spark? If yes this is clearly regression. Still in this case can you please show us how we can reproduce it manually (a small example code for write/read)?
   > 
   > If it was written by another tool can we got an example parquet file with sample data where the old version works and the new version fails?
   
   @attilapiros so the parquet file is being wrote by another process. Spark uses this data to run aggregations and analysis over different time horizons where the nanosecond precision is required. Currently, when using earlier Spark versions (< 3.2) the `TIMESTAMP(NANOS, true)` in the parquet schema is automatically converted to a `LongType`, however, since the moving from parquet `1.10.1` to `1.12.3` and the changes to `ParquetSchemaConverter` an `illegalType()` is thrown. As soon as I have access this evening I will provide an example parquet file.
   
   Whilst I understand timestamps with nanosecond precision are not fully supported, this change in behaviour will prevent users from migrating to the latest spark version


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1004026112


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession {
 }
 
 class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
+  testSchemaInference[Tuple1[Long]](

Review Comment:
   Then is this really a regression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1004137291


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession {
 }
 
 class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
+  testSchemaInference[Tuple1[Long]](

Review Comment:
   So I've been looking further into it, it's because the message is different between `1.10.1` and `1.12.3`.
   
   In `1.10.1` the message is
   
   ```
   message schema {
     required int64 attribute;
   }
   ```
   
   where as `1.12.3` the message is the same as the unit test
   
   ```
   message schema {
     required int64 attribute (TIMESTAMP(NANOS,true));
   }
   ```
   
   So in Spark 3.1.0 you end up hitting this block with returns a `LongType` https://github.com/apache/spark/blob/branch-3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L146
   
   where as since 3.2 you hit https://github.com/apache/spark/blob/branch-3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L174 because a case for `TimeUnit.NANOS` is not covered



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1003756142


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##########
@@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession {
 }
 
 class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
+  testSchemaInference[Tuple1[Long]](

Review Comment:
   This particular case doesn’t pass, neither does similar tests for `TIMESTAMP(MILLIS,true)` etc https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L2234 due to `No enum constant org.apache.parquet.schema.OriginalType.TIMESTAMP` when trying to convert the message.
   
   https://github.com/apache/spark/blob/branch-3.1/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L65



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1305155414

   @cloud-fan where do we stand with this? Is this a regression? How do we proceed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1000804099


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   Hmm, how is this sufficient? we also should handle the long (nanos) accordingly in the Parquet reader, is that right? e.g.: `ParquetVectorUpdaterFactory`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1000927574


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
             } else {
               TimestampNTZType
             }
+          case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>

Review Comment:
   I agree it's weird and if it is unintended behaviour, from what I know it's been unintended since spark 2.3 so quite a long time (I have not checked versions before 2.3). However, will let you guys take the call on whether is should be considered a regression.
   
   I assume the change to support nanos will be a breaking change; I am completely for the support of nanos but would be nice if the original behaviour still existed up until the support of nanos is developed and released.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1286017469

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by GitBox <gi...@apache.org>.
attilapiros commented on PR #38312:
URL: https://github.com/apache/spark/pull/38312#issuecomment-1314564209

   @awdavidson I would like to understand the use case a bit better. Is the parquet file was written by an earlier Spark (version < 3.2) and does the error comes when that parquet file is read back with a latter Spark? If yes this is clearly regression. Still in this case can you please show us how we can reproduce it manually (a small example code for write/read)? 
   
   If it was written by another tool can we got an example parquet file with sample data where the old version works and the new version fails?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] awdavidson commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

Posted by "awdavidson (via GitHub)" <gi...@apache.org>.
awdavidson commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1099727909


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3775,6 +3775,13 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_PARQUET_NANOS_AS_LONG = buildConf("spark.sql.legacy.parquet.nanosAsLong")
+    .internal()
+    .doc("When true, the Parquet's nanos precision timestamps are converted to SQL long values.")
+    .version("3.2.3")

Review Comment:
   Sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org