You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jim Huang (Jira)" <ji...@apache.org> on 2020/02/18 01:57:00 UTC
[jira] [Created] (SPARK-30862) [Scala] date_format() incorrectly
compute time when applied against more than 1 column
Jim Huang created SPARK-30862:
---------------------------------
Summary: [Scala] date_format() incorrectly compute time when applied against more than 1 column
Key: SPARK-30862
URL: https://issues.apache.org/jira/browse/SPARK-30862
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.4
Environment: $ /opt/spark2/bin/spark-shell --master yarn --deploy-mode client --executor-memory 2G --executor-cores 2 --num-executors 3 --driver-memory 6G --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
Spark context Web UI available at http://localhost:4048
Spark context available as 'sc' (master = yarn, app id = application_123_456).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
Reporter: Jim Huang
Here is the sequence of code that demonstrates the weird inconsistent behavior that looks like a bug to me.
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
scala> case class Event(event_time: String)
defined class Event
scala> val eventDS = Seq(Event("2020-02-18T01:14:14.945Z")).toDS()
eventDS: org.apache.spark.sql.Dataset[Event] = [event_time: string]
scala> eventDS.show(false)
+------------------------+
|event_time |
+------------------------+
|2020-02-18T01:14:14.945Z|
+------------------------+
scala> val tsOperationsDF = eventDS.withColumn("add_4_hours", lit($"event_time" + expr("INTERVAL 4 HOURS"))).withColumn("sub_2_hours", lit($"event_time" - expr("INTERVAL 2 HOURS")))
tsOperationsDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]
scala> tsOperationsDF.show(false)
+------------------------+-----------------------+-----------------------+
|event_time |add_4_hours |sub_2_hours |
+------------------------+-----------------------+-----------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14.945|2020-02-17 23:14:14.945|
+------------------------+-----------------------+-----------------------+
scala> val tsOperationsFormattedDF = eventDS.withColumn("add_4_hours", lit(date_format(($"event_time" + expr("INTERVAL 4 HOURS")),"yyyy-MM-dd hh:mm:ss"))).withColumn("sub_2_hours", lit(date_format(($"event_time" - expr("INTERVAL 2 HOURS")),"yyyy-MM-dd hh:mm:ss")))
tsOperationsFormattedDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]
scala> tsOperationsFormattedDF.show(false)
+------------------------+-------------------+-------------------+
|event_time |add_4_hours |sub_2_hours |
+------------------------+-------------------+-------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14|2020-02-17 11:14:14|
+------------------------+-------------------+-------------------+
{code}
The timestamp operation is identical between the two data frames. However, the time results are different for the "sub_2_hours" column. Any further confirmation and validation of this bug is much appreciated.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org