You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Noah Lidell (JIRA)" <ji...@apache.org> on 2018/07/06 18:53:00 UTC

[jira] [Created] (SPARK-24752) date_format provides incorrect year after a timezone conversation changes the year on a timestamp

Noah Lidell created SPARK-24752:
-----------------------------------

             Summary: date_format provides incorrect year after a timezone conversation changes the year on a timestamp
                 Key: SPARK-24752
                 URL: https://issues.apache.org/jira/browse/SPARK-24752
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core, SQL
    Affects Versions: 2.1.2
            Reporter: Noah Lidell


Initially I have a dataframe with a column for date and a column for time in UTC time. I combine these two columns into a single string that represents a UTC timestamp. I then convert that UTC timestamp into a Central Time timestamp using pyspark.sql.functions.from_utc_timestamp(). In the example below I have timestamps that are early in the day on Janurary 1st UTC, so that when they are converted to Central Time the resulting timestamp is on December 31st of the previous year. These timestamps are produced correctly. However, if I then use pypspark.sql.functions.date_format() to format a date from the one of these December 31st Central Time timestamps then the year portion of the output will be wrong. 

 

 
{code:java}
import pyspark.sql.functions as F

columns = ["observation_date","observation_time"]
vals = [
('2015-01-01', '02:22'),
('2016-01-01', '02:22'),
('2016-01-01', '08:22'),
('2016-01-02', '02:22'),
('2017-01-01', '02:22'),
('2018-01-01', '02:22'),
('2019-01-01', '02:22'),
]

test_df = spark.createDataFrame(vals, columns)

test_df.show()
# OUTPUT
# +----------------+----------------+
# |observation_date|observation_time|
# +----------------+----------------+
# |      2015-01-01|           02:22|
# |      2016-01-01|           02:22|
# |      2016-01-01|           08:22|
# |      2016-01-02|           02:22|
# |      2017-01-01|           02:22|
# |      2018-01-01|           02:22|
# |      2019-01-01|           02:22|
# +----------------+----------------+

renamed = test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")

utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), F.lit(" "), F.col("UTC_observation_time")))

utc_ts.show()
# OUTPUT
# +--------------------+--------------------+----------------+
# |UTC_observation_date|UTC_observation_time|             UTC|
# +--------------------+--------------------+----------------+
# |          2015-01-01|               02:22|2015-01-01 02:22|
# |          2016-01-01|               02:22|2016-01-01 02:22|
# |          2016-01-01|               08:22|2016-01-01 08:22|
# |          2016-01-02|               02:22|2016-01-02 02:22|
# |          2017-01-01|               02:22|2017-01-01 02:22|
# |          2018-01-01|               02:22|2018-01-01 02:22|
# |          2019-01-01|               02:22|2019-01-01 02:22|
# +--------------------+--------------------+----------------+

central_ts = utc_ts.withColumn("central_timestamp", F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))

final = central_ts.withColumn('observation_date', F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
.withColumn('observation_time', F.date_format(central_ts.central_timestamp, "HH:mm"))

final.select('observation_date','observation_time','central_timestamp','UTC').show()
# OUTPUT
# +----------------+----------------+--------------------+----------------+
# |observation_date|observation_time|   central_timestamp|             UTC|
# +----------------+----------------+--------------------+----------------+
# |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
# |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
# |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
# |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
# |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
# |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
# |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
# +----------------+----------------+--------------------+----------------+{code}
The Central Time timestamp is generated correctly by the from_utc_timestamp function but when formatted by the date_format function _and if_ the timezone conversation caused a change in year then the formatted string produced by the date_format function will have the wrong year.

In summary, this is the incorrect behavior:

UTC timestamp="2016-01-01 02:22" -(apply from_utc_timestamp)-> Central timestamp="2015-12-31 20:22" --(apply date_format "YYYY-MM-dd")-> date="2016-12-31"

output instead should be "2015-12-31"

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org