You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2016/04/27 23:48:12 UTC
[jira] [Commented] (SPARK-12683) SQL timestamp is wrong when accessed as Python datetime

    [ https://issues.apache.org/jira/browse/SPARK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261004#comment-15261004 ] 

Davies Liu commented on SPARK-12683:
------------------------------------

Done some debugging on this, it seems that the Java library support timezone and daylight saving time beyond 2038, see:

{code}
>>> sqlContext.sql("""select cast(cast('2038-03-14 3:00:00' as timestamp) as bigint) as ts""").collect()
[Row(ts=2152173600)]
>>> sqlContext.sql("""select cast(cast('2038-03-14 2:00:00' as timestamp) as bigint) as ts""").collect()
[Row(ts=2152173600)]
{code}

But Python library does not. Once we are getting close to 2038, Python library will support that eventually.

> SQL timestamp is wrong when accessed as Python datetime
> -------------------------------------------------------
>
>                 Key: SPARK-12683
>                 URL: https://issues.apache.org/jira/browse/SPARK-12683
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.5.1, 1.5.2, 1.6.0
>         Environment: Windows 7 Pro x64
> Python 3.4.3
> py4j 0.9
>            Reporter: Gerhard Fiedler
>         Attachments: spark_bug_date.py
>
>
> When accessing SQL timestamp data through {{.show()}}, it looks correct, but when accessing it (as Python {{datetime}}) through {{.collect()}}, it is wrong.
> {code}
> from datetime import datetime
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> if __name__ == "__main__":
>     spark_context = SparkContext(appName='SparkBugTimestampHour')
>     sql_context = SQLContext(spark_context)
>     sql_text = """select cast('2100-09-09 12:11:10.09' as timestamp) as ts"""
>     data_frame = sql_context.sql(sql_text)
>     data_frame.show(truncate=False)
>     # Result from .show() (as expected, looks correct):
>     # +----------------------+
>     # |ts                    |
>     # +----------------------+
>     # |2100-09-09 12:11:10.09|
>     # +----------------------+
>     rows = data_frame.collect()
>     row = rows[0]
>     ts = row[0]
>     print('ts={ts}'.format(ts=ts))
>     # Expected result from this print statement:
>     # ts=2100-09-09 12:11:10.090000
>     #
>     # Actual, wrong result (note the hours being 18 instead of 12):
>     # ts=2100-09-09 18:11:10.090000
>     #
>     # This error seems to be dependent on some characteristic of the system. We couldn't reproduce
>     # this on all of our systems, but it is not clear what the differences are. One difference is
>     # the processor: it failed on Intel Xeon E5-2687W v2.
>     assert isinstance(ts, datetime)
>     assert ts.year == 2100 and ts.month == 9 and ts.day == 9
>     assert ts.minute == 11 and ts.second == 10 and ts.microsecond == 90000
>     if ts.hour != 12:
>         print('hour is not correct; should be 12, is actually {hour}'.format(hour=ts.hour))
>     spark_context.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org