You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2020/03/16 08:33:00 UTC
[jira] [Comment Edited] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

    [ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060024#comment-17060024 ] 

Wenchen Fan edited comment on SPARK-30951 at 3/16/20, 8:32 AM:
---------------------------------------------------------------

[~dongjoon] different query result doesn't always mean correctness issue. In this case, it's well documented (in the migration guide) that datetime operations before 1582 will have slightly different results due to the calendar switch.

This PR reports the missing support of legacy data files, but it's not a correctness issue. It's a general problem of file formats like Parquet/Avro where the date/timestamp type is not well defined (missing calendar information). For example, if we use Spark 2.4 to read parquet files written by Hive 3.x, we will also get unexpected results as the calendar is different.

On the other hand, the Proleptic Gregorian calendar is the de-facto calendar of our world, so it's reasonable to assume the calendar should be Proleptic Gregorian if file formats don't define it explicitly. That said, the parquet files written Spark 2.x was wrong, instead of Spark 3.0 having a correctness issue.

Think about if we fix a correctness issue of a SQL function in 3.0, but users already have a data file containing the result of this SQL function, written by Spark 2.4. Users would get unexpected result but it doesn't mean 3.0 has a correctness issue.


was (Author: cloud_fan):
[~dongjoon] different query result doesn't always mean correctness issue. In this case, it's well documented (in the migration guide) that datetime operations before 1580 will have slightly different results due to the calendar switch.

This PR reports the missing support of legacy data files, but it's not a correctness issue. It's a general problem of file formats like Parquet/Avro where the date/timestamp type is not well defined (missing calendar information). For example, if we use Spark 2.4 to read parquet files written by Hive 3.x, we will also get unexpected results as the calendar is different.

On the other hand, the Proleptic Gregorian calendar is the de-facto calendar of our world, so it's reasonable to assume the calendar should be Proleptic Gregorian if file formats don't define it explicitly. That said, the parquet files written Spark 2.x was wrong, instead of Spark 3.0 having a correctness issue.

Think about if we fix a correctness issue of a SQL function in 3.0, but users already have a data file containing the result of this SQL function, written by Spark 2.4. Users would get unexpected result but it doesn't mean 3.0 has a correctness issue.

> Potential data loss for legacy applications after switch to proleptic Gregorian calendar
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-30951
>                 URL: https://issues.apache.org/jira/browse/SPARK-30951
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Bruce Robbins
>            Priority: Blocker
>              Labels: correctness
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data containing dates before October 15, 1582. This could be an issue when such sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode sensitive dates. On insert, the library encodes the actual date as some other date. On select, the library decodes the date back to the original date. The encoded value could be any date, including one before October 15, 1582 (e.g., "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was upgraded to use the proleptic Gregorian calendar. Spark applications that read files created by the upgraded component were interpreting encoded or marker dates incorrectly, and vice versa. Also, their data now had a mix of calendars (hybrid and proleptic Gregorian) with no metadata to indicate which file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into trouble when run on Spark 3. The application may not properly interpret the dates previously written by Spark 2. Also, once the Spark 3 version of the application writes data, the tables will have a mix of calendars (hybrid and proleptic gregorian) with no metadata to indicate which file uses which calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data written by one version that cannot be interpreted by the other. And as above, the tables will now have a mix of calendars with no way to detect which file uses which calendar.
> As with the two real-life example cases, these applications may have enormous amounts of legacy data, so re-encoding the dates using some other scheme may not be feasible.
> We might want to consider a configuration setting to allow the user to specify the calendar for storing and retrieving date and timestamp values (not sure how such a flag would affect other date and timestamp-related functions). I realize the change is far bigger than just adding a configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to proleptic Gregorian calendar between 2.x and 3.x. After some upgrade headaches related to dates before 1582, the Hive community made the following changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that lack calendar metadata, Hive's behavior is determined by a configuration setting. This allows Hive to read legacy data (note: if the data already consists of a mix of calendar types with no metadata, there is no good solution).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org