You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Simon (Jira)" <ji...@apache.org> on 2020/12/08 16:14:00 UTC

[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

    [ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245969#comment-17245969 ] 

Simon edited comment on SPARK-33571 at 12/8/20, 4:13 PM:
---------------------------------------------------------

[~maxgekk] Thanks for taking the time to look into this, for the updates to the documentation and for the explanation!
The actual data I ran into this issue with used the year 220 so that's why I used it, of course that's the one century with a 0 day diff :P The table with the different diffs between the two calendars cleared it up a lot, I used some different dates and can now also see the differences between the two read modes.

If you don't mind I have two additional questions:
> Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config `datetimeRebaseModeInRead` does not influence on reading such types in Spark 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added separate configs for INT96...


The behavior of the to be introduced in Spark 3.1 `spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for `datetimeRebaseModeIn*`? So Spark will check the parquet metadata for Spark version and the `datetimeRebaseModeInRead` metadata key and use the correct behavior. If those are not set it will raise an exception and ask the user to define the mode. Is that correct?

(P.S. You explicitly mention Spark 2.4.5 writes timestamps as INT96, but from my testing Spark 3 does the same by default, not sure if that aligns with your findings?)

> For INT96, it seems it is correct behavior. We should observe different results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config spark.sql.parquet.outputTimestampType.

What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with regards to this?


was (Author: simonvanderveldt):
[~maxgekk] Thanks for taking the time to look into this, for the updates to the documentation and for the explanation!
The actual data I ran into this issue with used the year 220 so that's why I used it, of course that's the one century with a 0 day diff :P The table with the different diffs between the two calendars cleared it up a lot, I used some different dates and can now also see the differences between the two read modes.

If you don't mind I have two additional questions:

{noformat}
Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config `datetimeRebaseModeInRead` does not influence on reading such types in Spark 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added separate configs for INT96...
{noformat}


The behavior of the to be introduced in Spark 3.1 `spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for `datetimeRebaseModeIn*`? So Spark will check the parquet metadata for Spark version and the `datetimeRebaseModeInRead` metadata key and use the correct behavior. If those are not set it will raise an exception and ask the user to define the mode. Is that correct?

(P.S. You explicitly mention Spark 2.4.5 writes timestamps as INT96, but from my testing Spark 3 does the same by default, not sure if that aligns with your findings?)

> For INT96, it seems it is correct behavior. We should observe different results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config spark.sql.parquet.outputTimestampType.

What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with regards to this?

> Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33571
>                 URL: https://issues.apache.org/jira/browse/SPARK-33571
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.0.0, 3.0.1
>            Reporter: Simon
>            Priority: Major
>             Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time a `SparkUpgradeException` should be raised informing the user to choose either `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time and `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should show the same values in Spark 3.0.1. with for example `df.show()` as they did in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time and `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps should show different values in Spark 3.0.1. with for example `df.show()` as they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or timestamps before the above mentioned moment in time a `SparkUpgradeException` should be raised informing the user to choose either `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to find any clear documentation on the expected behavior. The understanding I have was pieced together from the mailing list ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` which contain timestamps before the above mentioned moments in time without `datetimeRebaseModeInRead` set doesn't raise the `SparkUpgradeException`, it succeeds without any changes to the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` or `DateType` which contain dates or timestamps before the above mentioned moments in time with `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the dataframe as when using `CORRECTED`, so it seems like no rebasing is happening.
> I've made some scripts to help with testing/show the behavior, it uses pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org