You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2016/04/27 23:48:12 UTC
[jira] [Commented] (SPARK-12683) SQL timestamp is wrong when
accessed as Python datetime
[ https://issues.apache.org/jira/browse/SPARK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261004#comment-15261004 ]
Davies Liu commented on SPARK-12683:
------------------------------------
Done some debugging on this, it seems that the Java library support timezone and daylight saving time beyond 2038, see:
{code}
>>> sqlContext.sql("""select cast(cast('2038-03-14 3:00:00' as timestamp) as bigint) as ts""").collect()
[Row(ts=2152173600)]
>>> sqlContext.sql("""select cast(cast('2038-03-14 2:00:00' as timestamp) as bigint) as ts""").collect()
[Row(ts=2152173600)]
{code}
But Python library does not. Once we are getting close to 2038, Python library will support that eventually.
> SQL timestamp is wrong when accessed as Python datetime
> -------------------------------------------------------
>
> Key: SPARK-12683
> URL: https://issues.apache.org/jira/browse/SPARK-12683
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.5.1, 1.5.2, 1.6.0
> Environment: Windows 7 Pro x64
> Python 3.4.3
> py4j 0.9
> Reporter: Gerhard Fiedler
> Attachments: spark_bug_date.py
>
>
> When accessing SQL timestamp data through {{.show()}}, it looks correct, but when accessing it (as Python {{datetime}}) through {{.collect()}}, it is wrong.
> {code}
> from datetime import datetime
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> if __name__ == "__main__":
> spark_context = SparkContext(appName='SparkBugTimestampHour')
> sql_context = SQLContext(spark_context)
> sql_text = """select cast('2100-09-09 12:11:10.09' as timestamp) as ts"""
> data_frame = sql_context.sql(sql_text)
> data_frame.show(truncate=False)
> # Result from .show() (as expected, looks correct):
> # +----------------------+
> # |ts |
> # +----------------------+
> # |2100-09-09 12:11:10.09|
> # +----------------------+
> rows = data_frame.collect()
> row = rows[0]
> ts = row[0]
> print('ts={ts}'.format(ts=ts))
> # Expected result from this print statement:
> # ts=2100-09-09 12:11:10.090000
> #
> # Actual, wrong result (note the hours being 18 instead of 12):
> # ts=2100-09-09 18:11:10.090000
> #
> # This error seems to be dependent on some characteristic of the system. We couldn't reproduce
> # this on all of our systems, but it is not clear what the differences are. One difference is
> # the processor: it failed on Intel Xeon E5-2687W v2.
> assert isinstance(ts, datetime)
> assert ts.year == 2100 and ts.month == 9 and ts.day == 9
> assert ts.minute == 11 and ts.second == 10 and ts.microsecond == 90000
> if ts.hour != 12:
> print('hour is not correct; should be 12, is actually {hour}'.format(hour=ts.hour))
> spark_context.stop()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org