You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jim Huang (Jira)" <ji...@apache.org> on 2020/02/18 01:57:00 UTC
[jira] [Created] (SPARK-30862) [Scala] date_format() incorrectly compute time when applied against more than 1 column

Jim Huang created SPARK-30862:
---------------------------------

             Summary: [Scala] date_format() incorrectly compute time when applied against more than 1 column
                 Key: SPARK-30862
                 URL: https://issues.apache.org/jira/browse/SPARK-30862
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.4
         Environment: $ /opt/spark2/bin/spark-shell --master yarn --deploy-mode client --executor-memory 2G --executor-cores 2 --num-executors 3 --driver-memory 6G --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4

 

Spark context Web UI available at http://localhost:4048
Spark context available as 'sc' (master = yarn, app id = application_123_456).
Spark session available as 'spark'.
Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.4.4
 /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
            Reporter: Jim Huang


Here is the sequence of code that demonstrates the weird inconsistent behavior that looks like a bug to me.  

 
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

scala> case class Event(event_time: String)
defined class Event

scala> val eventDS = Seq(Event("2020-02-18T01:14:14.945Z")).toDS()
eventDS: org.apache.spark.sql.Dataset[Event] = [event_time: string]

scala> eventDS.show(false)
+------------------------+
|event_time              |
+------------------------+
|2020-02-18T01:14:14.945Z|
+------------------------+

scala> val tsOperationsDF = eventDS.withColumn("add_4_hours", lit($"event_time" + expr("INTERVAL 4 HOURS"))).withColumn("sub_2_hours", lit($"event_time" - expr("INTERVAL 2 HOURS")))
tsOperationsDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]

scala> tsOperationsDF.show(false)
+------------------------+-----------------------+-----------------------+
|event_time              |add_4_hours            |sub_2_hours            |
+------------------------+-----------------------+-----------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14.945|2020-02-17 23:14:14.945|
+------------------------+-----------------------+-----------------------+

scala> val tsOperationsFormattedDF = eventDS.withColumn("add_4_hours", lit(date_format(($"event_time" + expr("INTERVAL 4 HOURS")),"yyyy-MM-dd hh:mm:ss"))).withColumn("sub_2_hours", lit(date_format(($"event_time" - expr("INTERVAL 2 HOURS")),"yyyy-MM-dd hh:mm:ss")))
tsOperationsFormattedDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]

scala> tsOperationsFormattedDF.show(false)
+------------------------+-------------------+-------------------+
|event_time              |add_4_hours        |sub_2_hours        |
+------------------------+-------------------+-------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14|2020-02-17 11:14:14|
+------------------------+-------------------+-------------------+

{code}
The timestamp operation is identical between the two data frames.  However, the time results are different for the "sub_2_hours" column.  Any further confirmation and validation of this bug is much appreciated.  

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org