You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jim Huang (Jira)" <ji...@apache.org> on 2020/02/18 02:10:00 UTC
[jira] [Updated] (SPARK-30862) [Scala] date_format() incorrectly compute time when applied against more than 1 column

     [ https://issues.apache.org/jira/browse/SPARK-30862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Huang updated SPARK-30862:
------------------------------
    Description: 
Here is the sequence of code that demonstrates the weird inconsistent behavior that looks like a bug to me.  

 
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

scala> case class Event(event_time: String)
defined class Event

scala> val eventDS = Seq(Event("2020-02-18T01:14:14.945Z"), Event("2020-02-18T02:02:02.975Z")).toDS()
eventDS: org.apache.spark.sql.Dataset[Event] = [event_time: string]

scala> eventDS.show(false)
+------------------------+
|event_time              |
+------------------------+
|2020-02-18T01:14:14.945Z|
|2020-02-18T02:02:02.975Z|
+------------------------+

scala> val tsOperationsDF = eventDS.withColumn("add_4_hours", lit($"event_time" + expr("INTERVAL 4 HOURS"))).withColumn("sub_2_hours", lit($"event_time" - expr("INTERVAL 2 HOURS")))
tsOperationsDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]

scala> tsOperationsDF.show(false)
+------------------------+-----------------------+-----------------------+
|event_time              |add_4_hours            |sub_2_hours            |
+------------------------+-----------------------+-----------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14.945|2020-02-17 23:14:14.945|
|2020-02-18T02:02:02.975Z|2020-02-18 06:02:02.975|2020-02-18 00:02:02.975|
+------------------------+-----------------------+-----------------------+

scala> val tsOperationsFormattedDF = eventDS.withColumn("add_4_hours", lit(date_format(($"event_time" + expr("INTERVAL 4 HOURS")),"yyyy-MM-dd hh:mm:ss"))).withColumn("sub_2_hours", lit(date_format(($"event_time" - expr("INTERVAL 2 HOURS")),"yyyy-MM-dd hh:mm:ss")))
tsOperationsFormattedDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]

scala> tsOperationsFormattedDF.show(false)
+------------------------+-------------------+-------------------+
|event_time              |add_4_hours        |sub_2_hours        |
+------------------------+-------------------+-------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14|2020-02-17 11:14:14|
|2020-02-18T02:02:02.975Z|2020-02-18 06:02:02|2020-02-18 12:02:02|
+------------------------+-------------------+-------------------+

{code}
The timestamp operations are identical between the two data frames.  However, the time results are different for the "sub_2_hours" column. The first data frame computed the correct time results, but not the second data frame when date_format() are used.

Any further confirmation and validation of this bug is much appreciated.  

 

 

  was:
Here is the sequence of code that demonstrates the weird inconsistent behavior that looks like a bug to me.  

 
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

scala> case class Event(event_time: String)
defined class Event

scala> val eventDS = Seq(Event("2020-02-18T01:14:14.945Z")).toDS()
eventDS: org.apache.spark.sql.Dataset[Event] = [event_time: string]

scala> eventDS.show(false)
+------------------------+
|event_time              |
+------------------------+
|2020-02-18T01:14:14.945Z|
+------------------------+

scala> val tsOperationsDF = eventDS.withColumn("add_4_hours", lit($"event_time" + expr("INTERVAL 4 HOURS"))).withColumn("sub_2_hours", lit($"event_time" - expr("INTERVAL 2 HOURS")))
tsOperationsDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]

scala> tsOperationsDF.show(false)
+------------------------+-----------------------+-----------------------+
|event_time              |add_4_hours            |sub_2_hours            |
+------------------------+-----------------------+-----------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14.945|2020-02-17 23:14:14.945|
+------------------------+-----------------------+-----------------------+

scala> val tsOperationsFormattedDF = eventDS.withColumn("add_4_hours", lit(date_format(($"event_time" + expr("INTERVAL 4 HOURS")),"yyyy-MM-dd hh:mm:ss"))).withColumn("sub_2_hours", lit(date_format(($"event_time" - expr("INTERVAL 2 HOURS")),"yyyy-MM-dd hh:mm:ss")))
tsOperationsFormattedDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]

scala> tsOperationsFormattedDF.show(false)
+------------------------+-------------------+-------------------+
|event_time              |add_4_hours        |sub_2_hours        |
+------------------------+-------------------+-------------------+
|2020-02-18T01:14:14.945Z|2020-02-18 05:14:14|2020-02-17 11:14:14|
+------------------------+-------------------+-------------------+

{code}
The timestamp operation is identical between the two data frames.  However, the time results are different for the "sub_2_hours" column.  Any further confirmation and validation of this bug is much appreciated.  

 

 


> [Scala] date_format() incorrectly compute time when applied against more than 1 column
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-30862
>                 URL: https://issues.apache.org/jira/browse/SPARK-30862
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: $ /opt/spark2/bin/spark-shell --master yarn --deploy-mode client --executor-memory 2G --executor-cores 2 --num-executors 3 --driver-memory 6G --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
>  
> Spark context Web UI available at http://localhost:4048
> Spark context available as 'sc' (master = yarn, app id = application_123_456).
> Spark session available as 'spark'.
> Welcome to
>  ____ __
>  / __/__ ___ _____/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 2.4.4
>  /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
> Type in expressions to have them evaluated.
> Type :help for more information.
>            Reporter: Jim Huang
>            Priority: Major
>
> Here is the sequence of code that demonstrates the weird inconsistent behavior that looks like a bug to me.  
>  
> {code:java}
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> case class Event(event_time: String)
> defined class Event
> scala> val eventDS = Seq(Event("2020-02-18T01:14:14.945Z"), Event("2020-02-18T02:02:02.975Z")).toDS()
> eventDS: org.apache.spark.sql.Dataset[Event] = [event_time: string]
> scala> eventDS.show(false)
> +------------------------+
> |event_time              |
> +------------------------+
> |2020-02-18T01:14:14.945Z|
> |2020-02-18T02:02:02.975Z|
> +------------------------+
> scala> val tsOperationsDF = eventDS.withColumn("add_4_hours", lit($"event_time" + expr("INTERVAL 4 HOURS"))).withColumn("sub_2_hours", lit($"event_time" - expr("INTERVAL 2 HOURS")))
> tsOperationsDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]
> scala> tsOperationsDF.show(false)
> +------------------------+-----------------------+-----------------------+
> |event_time              |add_4_hours            |sub_2_hours            |
> +------------------------+-----------------------+-----------------------+
> |2020-02-18T01:14:14.945Z|2020-02-18 05:14:14.945|2020-02-17 23:14:14.945|
> |2020-02-18T02:02:02.975Z|2020-02-18 06:02:02.975|2020-02-18 00:02:02.975|
> +------------------------+-----------------------+-----------------------+
> scala> val tsOperationsFormattedDF = eventDS.withColumn("add_4_hours", lit(date_format(($"event_time" + expr("INTERVAL 4 HOURS")),"yyyy-MM-dd hh:mm:ss"))).withColumn("sub_2_hours", lit(date_format(($"event_time" - expr("INTERVAL 2 HOURS")),"yyyy-MM-dd hh:mm:ss")))
> tsOperationsFormattedDF: org.apache.spark.sql.DataFrame = [event_time: string, add_4_hours: string ... 1 more field]
> scala> tsOperationsFormattedDF.show(false)
> +------------------------+-------------------+-------------------+
> |event_time              |add_4_hours        |sub_2_hours        |
> +------------------------+-------------------+-------------------+
> |2020-02-18T01:14:14.945Z|2020-02-18 05:14:14|2020-02-17 11:14:14|
> |2020-02-18T02:02:02.975Z|2020-02-18 06:02:02|2020-02-18 12:02:02|
> +------------------------+-------------------+-------------------+
> {code}
> The timestamp operations are identical between the two data frames.  However, the time results are different for the "sub_2_hours" column. The first data frame computed the correct time results, but not the second data frame when date_format() are used.
> Any further confirmation and validation of this bug is much appreciated.  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org