You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stephen Brennan (JIRA)" <ji...@apache.org> on 2018/07/27 17:54:00 UTC
[jira] [Commented] (SPARK-18381) Wrong date conversion between spark and python for dates before 1583

    [ https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16560068#comment-16560068 ] 

Stephen Brennan commented on SPARK-18381:
-----------------------------------------

Just encountered this issue myself and came here to report it. The root cause, as far as I can tell, is that PySpark uses Python day ordinals to translate between the internal and external forms ([see code in PySpark docs|https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/types.html#DateType]). Python's day ordinals are relative to day 1: January 1st, year 1, on the [proleptic Gregorian calendar|https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar], which is an idealized calendar that assumes that the Gregorian calendar extends indefinitely into the past and future.

In reality, the Gregorian calendar switch began in October of 1582 - 10 days were skipped - they went from October 4th directly to October 15th.

It seems that whatever internal date library Spark is using does not use this proleptic Gregorian calendar, and actually switches to the Julian calendar, which was in use before the Gregorian.  The dates are correct, but in different calendars! The following PySpark code demonstrates this:
{code:java}
>>> from datetime import date
>>> df = spark.createDataFrame([
    {'date': date(1, 1, 1)},
    {'date': date(1582, 10, 3)},
    {'date': date(1582, 10, 4)},
    {'date': date(1582, 10, 5)},
    {'date': date(1582, 10, 6)},
    {'date': date(1582, 10, 7)},
    {'date': date(1582, 10, 8)},
    {'date': date(1582, 10, 9)},
    {'date': date(1582, 10, 10)},
    {'date': date(1582, 10, 11)},
    {'date': date(1582, 10, 12)},
    {'date': date(1582, 10, 13)},
    {'date': date(1582, 10, 14)},
    {'date': date(1582, 10, 15)},
    {'date': date(2016, 6, 6)},
])

+----------+
|      date|
+----------+
|0001-01-03|
|1582-09-23|
|1582-09-24|
|1582-09-25|
|1582-09-26|
|1582-09-27|
|1582-09-28|
|1582-09-29|
|1582-09-30|
|1582-10-01|
|1582-10-02|
|1582-10-03|
|1582-10-04|
|1582-10-15|
|2016-06-06|
+----------+{code}
The correct fix for this may simply be to include a note in the documentation that Spark uses the Julian calendar before October 15, 1582, which is in contrast to Python's idealized Gregorian calendar?

> Wrong date conversion between spark and python for dates before 1583
> --------------------------------------------------------------------
>
>                 Key: SPARK-18381
>                 URL: https://issues.apache.org/jira/browse/SPARK-18381
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Luca Caniparoli
>            Priority: Major
>
> Dates before 1538 (julian/gregorian calendar transition) are processed incorrectly. 
> * With python udf (datetime.strptime), .show() returns wrong dates but .collect() returns correct dates
> * With pyspark.sql.functions.to_date, .show() shows correct dates but .collect() returns wrong dates. Additionally, collecting '0001-01-01' returns error when collecting dataframe. 
> {code:none}
> from pyspark.sql.types import DateType
> from pyspark.sql.functions import to_date, udf
> from datetime import datetime
> strToDate =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
> l = [('0002-01-01', 1), ('1581-01-01', 2), ('1582-01-01', 3), ('1583-01-01', 4), ('1584-01-01', 5), ('2012-01-21', 6)]
> l_older = [('0001-01-01', 1)]
> test_df = spark.createDataFrame(l, ["date_string", "number"])
> test_df_older = spark.createDataFrame(l_older, ["date_string", "number"])
> test_df_strptime = test_df.withColumn( "date_cast", strToDate(test_df["date_string"]))
> test_df_todate = test_df.withColumn( "date_cast", to_date(test_df["date_string"]))
> test_df_older_todate = test_df_older.withColumn( "date_cast", to_date(test_df_older["date_string"]))
> test_df_strptime.show()
> test_df_todate.show()
> print test_df_strptime.collect()
> print test_df_todate.collect()
> print test_df_older_todate.collect()
> {code}
> {noformat}
> +-----------+------+----------+
> |date_string|number| date_cast|
> +-----------+------+----------+
> | 0002-01-01|     1|0002-01-03|
> | 1581-01-01|     2|1580-12-22|
> | 1582-01-01|     3|1581-12-22|
> | 1583-01-01|     4|1583-01-01|
> | 1584-01-01|     5|1584-01-01|
> | 2012-01-21|     6|2012-01-21|
> +-----------+------+----------+
> +-----------+------+----------+
> |date_string|number| date_cast|
> +-----------+------+----------+
> | 0002-01-01|     1|0002-01-01|
> | 1581-01-01|     2|1581-01-01|
> | 1582-01-01|     3|1582-01-01|
> | 1583-01-01|     4|1583-01-01|
> | 1584-01-01|     5|1584-01-01|
> | 2012-01-21|     6|2012-01-21|
> +-----------+------+----------+
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(2, 1, 1)), Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 1, 1)), Row(date_string=u'1582-01-01', number=3, date_cast=datetime.date(1582, 1, 1)), Row(date_string=u'1583-01-01', number=4, date_cast=datetime.date(1583, 1, 1)), Row(date_string=u'1584-01-01', number=5, date_cast=datetime.date(1584, 1, 1)), Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, 1, 21))]
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(1, 12, 30)), Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 1, 11)), Row(date_string=u'1582-01-01', number=3, date_cast=datetime.date(1582, 1, 11)), Row(date_string=u'1583-01-01', number=4, date_cast=datetime.date(1583, 1, 1)), Row(date_string=u'1584-01-01', number=5, date_cast=datetime.date(1584, 1, 1)), Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, 1, 21))]
> Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 267, in <module>
>     raise Exception(traceback.format_exc())
> Exception: Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 265, in <module>
>     exec(code)
>   File "<stdin>", line 15, in <module>
>   File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 311, in collect
>     return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
>   File "/usr/local/spark/python/pyspark/rdd.py", line 142, in _load_from_socket
>     for item in serializer.load_stream(rf):
>   File "/usr/local/spark/python/pyspark/serializers.py", line 139, in load_stream
>     yield self._read_with_length(stream)
>   File "/usr/local/spark/python/pyspark/serializers.py", line 164, in _read_with_length
>     return self.loads(obj)
>   File "/usr/local/spark/python/pyspark/serializers.py", line 422, in loads
>     return pickle.loads(obj)
>   File "/usr/local/spark/python/pyspark/sql/types.py", line 1354, in <lambda>
>     return lambda *a: dataType.fromInternal(a)
>   File "/usr/local/spark/python/pyspark/sql/types.py", line 600, in fromInternal
>     values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
>   File "/usr/local/spark/python/pyspark/sql/types.py", line 439, in fromInternal
>     return self.dataType.fromInternal(obj)
>   File "/usr/local/spark/python/pyspark/sql/types.py", line 176, in fromInternal
>     return datetime.date.fromordinal(v + self.EPOCH_ORDINAL)
> ValueError: ('ordinal must be >= 1', <function <lambda> at 0x7fa21bf7baa0>, (u'0001-01-01', 1, -719164))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org