You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by yohann jardin <yo...@hotmail.com> on 2017/06/02 17:10:51 UTC

Spark SQL, formatting timezone in UTC

Hello everyone,


I'm having a hard time with time zones.

I have a Long representing a timestamp: 1496361600000, I want the output to be 2017-06-02 00:00:00


Based on https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html

The only function that helps formatting a timestamp is from_unixtime, but it bases the output timezone as the system timezone... but my timezone and the server timezone I'm working on are not in UTC.

I couldn't find any help on Google among the dozens of JIRA, stackoverflow and blog articles I found.


In the end I decided to write a udf:

def formatMs(ms: Long): java.lang.String = {
    val formatter = new java.text.SimpleDateFormat("yyyyMMdd HH:mm:ss")
    formatter.setTimeZone(java.util.TimeZone.getTimeZone("GMT"));
    formatter.format(new java.util.Date(ms))
}

spark.udf.register("formatMs", formatMs)
spark.sql("SELECT formatMs(1496361600000)").show


But if I really go for that, it will decrease the performance of my application, right?

Like I need to aggregate some data based on such column. As my function is a black box, spark will use it first and aggregate on the String output of the udf, though it would lead to the same result by aggregate on the initial Long value and then using the udf.


I know I can also forget my udf in the sql query, and apply, on the created dataframe, the functions withColumn() and withColumnRenamed(), but that is something to bench.


Did I miss any possibility to do that within a SparkQL query using standard functions or something much more performant than what I can think of?


Regards,

Yohann