You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dean Wampler (JIRA)" <ji...@apache.org> on 2016/09/09 22:20:20 UTC

[jira] [Commented] (SPARK-16239) SQL issues with cast from date to string around daylight savings time

    [ https://issues.apache.org/jira/browse/SPARK-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15478438#comment-15478438 ] 

Dean Wampler commented on SPARK-16239:
--------------------------------------

I invested this a bit today for a customer. I could not reproduce this bug on MacOS X, Ubuntu, nor RedHat releases with kernels 3.10.0-327.el7.x86_64 and 2.6.32-504.8.1.el6.x86_64, using Amazon AMIs. My customer has a private cloud environment with kernel 2.6.32-504.50.1.el6.x86_64 where he sees the bug. Anyway, I think it's something very specific to his cloud VM configuration, such as a buggy library. For all cases we used this JVM:

{code}
$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
{code}

My point is that we should narrow down if this is really a Spark bug or a bug in the underlying platform.

For reference, here's the code example we used in his environment and my test environments (some output suppressed):

{code}
scala> sqlContext.udf.register("to_date", (s: String) =>
  new java.sql.Date(
    new java.text.SimpleDateFormat("yyyy-MM-dd").parse(s).getTime())
)

scala> val dates = (0 to 5).map(i => s"1949-11-${25+i}")

scala> val df = sc.parallelize(dates).toDF("date")
scala> df.show
+----------+
|      date|
+----------+
|1949-11-25|
|1949-11-26|
|1949-11-27|
|1949-11-28|
|1949-11-29|
|1949-11-30|
+----------+

scala> val df2 = df.select(to_date($"date"))
scala> df2.show

+------------+
|todate(date)|
+------------+
|  1949-11-25|
|  1949-11-26|
|  1949-11-27|  //  <--- my customer sees 1949-11-26
|  1949-11-28|
|  1949-11-29|
|  1949-11-30|
+------------+
{code}

If I'm right that this isn't really a Spark bug, then the following should be sufficient to demonstrate it in the Spark shell or a Scala interpreter of the same version:

{code}
scala> val f = (s: String) =>
  new java.sql.Date(
    new java.text.SimpleDateFormat("yyyy-MM-dd").parse(s).getTime())

scala> val d = f("1949-11-27")
d: java.sql.Date = 1949-11-27
{code}


> SQL issues with cast from date to string around daylight savings time
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16239
>                 URL: https://issues.apache.org/jira/browse/SPARK-16239
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Glen Maisey
>            Priority: Critical
>
> Hi all,
> I have a dataframe with a date column. When I cast to a string using the spark sql cast function it converts it to the wrong date on certain days. Looking into it, it occurs once a year when summer daylight savings starts.
> I've tried to show this issue the code below. The toString() function works correctly whereas the cast does not.
> Unfortunately my users are using SQL code rather than scala dataframes and therefore this workaround does not apply. This was actually picked up where a user was writing something like "SELECT date1 UNION ALL select date2" where date1 was a string and date2 was a date type. It must be implicitly converting the date to a string which gives this error.
> I'm in the Australia/Sydney timezone (see the time changes here http://www.timeanddate.com/time/zone/australia/sydney) 
> val dates = Array("2014-10-03","2014-10-04","2014-10-05","2014-10-06","2015-10-02","2015-10-03", "2015-10-04", "2015-10-05")
> val df = sc.parallelize(dates)
>             .toDF("txn_date")
>             .select(col("txn_date").cast("Date"))
> df.select(
>         col("txn_date"),
>         col("txn_date").cast("Timestamp").alias("txn_date_timestamp"),
>         col("txn_date").cast("String").alias("txn_date_str_cast"),
>         col("txn_date".toString()).alias("txn_date_str_toString")
>         )
>     .show()
> +----------+--------------------+-----------------+---------------------+
> |  txn_date|  txn_date_timestamp|txn_date_str_cast|txn_date_str_toString|
> +----------+--------------------+-----------------+---------------------+
> |2014-10-03|2014-10-02 14:00:...|       2014-10-03|           2014-10-03|
> |2014-10-04|2014-10-03 14:00:...|       2014-10-04|           2014-10-04|
> |2014-10-05|2014-10-04 13:00:...|       2014-10-04|           2014-10-05|
> |2014-10-06|2014-10-05 13:00:...|       2014-10-06|           2014-10-06|
> |2015-10-02|2015-10-01 14:00:...|       2015-10-02|           2015-10-02|
> |2015-10-03|2015-10-02 14:00:...|       2015-10-03|           2015-10-03|
> |2015-10-04|2015-10-03 13:00:...|       2015-10-03|           2015-10-04|
> |2015-10-05|2015-10-04 13:00:...|       2015-10-05|           2015-10-05|
> +----------+--------------------+-----------------+---------------------+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org