You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maxim Gekk (Jira)" <ji...@apache.org> on 2020/12/01 08:28:00 UTC

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

    [ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241339#comment-17241339 ] 

Maxim Gekk commented on SPARK-33571:
------------------------------------

[~simonvanderveldt] Thank you for the detailed description and your investigation. Let me clarify a few things:

> From our testing we're seeing several issues:
> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` which contain timestamps before the above mentioned moments in time without `datetimeRebaseModeInRead` set doesn't raise the `SparkUpgradeException`, it succeeds without any changes to the resulting dataframe compares to that dataframe in Spark 2.4.5

Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config `datetimeRebaseModeInRead` does not influence on reading such types in Spark 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added separate configs for INT96:
* https://github.com/apache/spark/pull/30056
* https://github.com/apache/spark/pull/30121

The changes will be released with Spark 3.1.0.

> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` or `DateType` which contain dates or timestamps before the above mentioned moments in time with `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the dataframe as when using `CORRECTED`, so it seems like no rebasing is happening.

For INT96, it seems it is correct behavior. We should observe different results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config spark.sql.parquet.outputTimestampType.

The DATE case is more interesting as we must see a difference in results for ancient dates. I will investigate this case. 

 

> Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33571
>                 URL: https://issues.apache.org/jira/browse/SPARK-33571
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.0.0, 3.0.1
>            Reporter: Simon
>            Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time a `SparkUpgradeException` should be raised informing the user to choose either `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time and `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should show the same values in Spark 3.0.1. with for example `df.show()` as they did in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time and `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps should show different values in Spark 3.0.1. with for example `df.show()` as they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or timestamps before the above mentioned moment in time a `SparkUpgradeException` should be raised informing the user to choose either `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to find any clear documentation on the expected behavior. The understanding I have was pieced together from the mailing list ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` which contain timestamps before the above mentioned moments in time without `datetimeRebaseModeInRead` set doesn't raise the `SparkUpgradeException`, it succeeds without any changes to the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` or `DateType` which contain dates or timestamps before the above mentioned moments in time with `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the dataframe as when using `CORRECTED`, so it seems like no rebasing is happening.
> I've made some scripts to help with testing/show the behavior, it uses pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org