You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (JIRA)" <ji...@apache.org> on 2018/07/09 03:44:00 UTC
[jira] [Resolved] (SPARK-24752) date_format provides incorrect year after a timezone conversation changes the year on a timestamp

     [ https://issues.apache.org/jira/browse/SPARK-24752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takeshi Yamamuro resolved SPARK-24752.
--------------------------------------
    Resolution: Not A Problem

> date_format provides incorrect year after a timezone conversation changes the year on a timestamp
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24752
>                 URL: https://issues.apache.org/jira/browse/SPARK-24752
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, SQL
>    Affects Versions: 2.1.2, 2.4.0
>            Reporter: Noah Lidell
>            Priority: Minor
>
> Initially I have a dataframe with a column for date and a column for time in UTC time. I combine these two columns into a single string that represents a UTC timestamp. I then convert that UTC timestamp into a Central Time timestamp using pyspark.sql.functions.from_utc_timestamp(). In the example below I have timestamps that are early in the day on Janurary 1st UTC, so that when they are converted to Central Time the resulting timestamp is on December 31st of the previous year. These timestamps are produced correctly. However, if I then use pyspark.sql.functions.date_format() to format a date from the one of these December 31st Central Time timestamps then the year portion of the output will be wrong. 
>  
>  
> {code:java}
> import pyspark.sql.functions as F
> columns = ["observation_date","observation_time"]
> vals = [
> ('2015-01-01', '02:22'),
> ('2016-01-01', '02:22'),
> ('2016-01-01', '08:22'),
> ('2016-01-02', '02:22'),
> ('2017-01-01', '02:22'),
> ('2018-01-01', '02:22'),
> ('2019-01-01', '02:22'),
> ]
> test_df = spark.createDataFrame(vals, columns)
> test_df.show()
> # OUTPUT
> # +----------------+----------------+
> # |observation_date|observation_time|
> # +----------------+----------------+
> # |      2015-01-01|           02:22|
> # |      2016-01-01|           02:22|
> # |      2016-01-01|           08:22|
> # |      2016-01-02|           02:22|
> # |      2017-01-01|           02:22|
> # |      2018-01-01|           02:22|
> # |      2019-01-01|           02:22|
> # +----------------+----------------+
> renamed = test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")
> utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), F.lit(" "), F.col("UTC_observation_time")))
> utc_ts.show()
> # OUTPUT
> # +--------------------+--------------------+----------------+
> # |UTC_observation_date|UTC_observation_time|             UTC|
> # +--------------------+--------------------+----------------+
> # |          2015-01-01|               02:22|2015-01-01 02:22|
> # |          2016-01-01|               02:22|2016-01-01 02:22|
> # |          2016-01-01|               08:22|2016-01-01 08:22|
> # |          2016-01-02|               02:22|2016-01-02 02:22|
> # |          2017-01-01|               02:22|2017-01-01 02:22|
> # |          2018-01-01|               02:22|2018-01-01 02:22|
> # |          2019-01-01|               02:22|2019-01-01 02:22|
> # +--------------------+--------------------+----------------+
> central_ts = utc_ts.withColumn("central_timestamp", F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))
> final = central_ts.withColumn('observation_date', F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
> .withColumn('observation_time', F.date_format(central_ts.central_timestamp, "HH:mm"))
> final.select('observation_date','observation_time','central_timestamp','UTC').show()
> # OUTPUT
> # +----------------+----------------+--------------------+----------------+
> # |observation_date|observation_time|   central_timestamp|             UTC|
> # +----------------+----------------+--------------------+----------------+
> # |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
> # |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
> # |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
> # |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
> # |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
> # |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
> # |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
> # +----------------+----------------+--------------------+----------------+{code}
> The Central Time timestamp is generated correctly by the from_utc_timestamp function but when formatted by the date_format function _and if_ the timezone conversation caused a change in year then the formatted string produced by the date_format function will have the wrong year.
> In summary, this is the incorrect behavior:
> {code:java}
> UTC timestamp="2016-01-01 02:22" 
> --(apply from_utc_timestamp)--> Central timestamp="2015-12-31 20:22" 
> --(apply date_format "YYYY-MM-dd")--> date="2016-12-31"{code}
> output instead should be "2015-12-31"
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org