You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chetan Khatri <ch...@gmail.com> on 2020/03/31 16:14:31 UTC

Unablee to get to_timestamp with Timezone Information

Hi Spark Users,

I am losing the timezone value from below format, I tried couple of formats
but not able to make it. Can someone throw lights?

scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value")
sampleDF: org.apache.spark.sql.DataFrame = [value: string]

scala> sampleDF.select('value, to_timestamp('value,
"yyyy-MM-dd\'T\'HH:mm:ss")).show(false)
+------------------------+------------------------------------------------+
|value                   |to_timestamp(`value`, 'yyyy-MM-dd\'T\'HH:mm:ss')|
+------------------------+------------------------------------------------+
|2020-04-11T20:40:00-0500|2020-04-11 20:40:00                             |
+------------------------+------------------------------------------------+

Thanks

Re: Unablee to get to_timestamp with Timezone Information

Posted by Chetan Khatri <ch...@gmail.com>.
Thanks Enrico, Magnus

On Thu, Apr 2, 2020 at 11:49 AM Enrico Minack <ma...@enrico.minack.dev>
wrote:

> Once parsed into a Timestamp the timestamp is store internally as UTC and
> printed as your local timezone (e.g. as defined by
> spark.sql.session.timeZone). Spark is good at hiding timezone information
> from you.
>
> You can get the timezone information via date_format(column, format):
>
> import org.apache.spark.sql.types.TimestampType
> import org.apache.spark.sql.functions._
>
> val sampleDF = Seq("2020-04-11T20:40:00-05:00").toDF("value")
> val timestampDF = sampleDF.select($"value".cast(TimestampType))
> timestampDF.select(date_format($"value",
> "yyyy-MM-dd'T'HH:mm:ssZZZZ")).show(false)
> +---------------------------------------------+
> |date_format(value, yyyy-MM-dd'T'HH:mm:ssZZZZ)|
> +---------------------------------------------+
> |2020-04-12T03:40:00+0200                     |
> +---------------------------------------------+
>
> If you want the timezone only, use timestampDF.select(date_format($"value",
> "ZZZZ")).show.
> +------------------------+
> |date_format(value, ZZZZ)|
> +------------------------+
> |                   +0200|
> +------------------------+
>
> It all depends how you get the data "downstream". If you go through
> parquet or csv files, they will retain the timezone information. If you go
> through strings, you should format them as above. If you use Dataset.map
> you can access the timestamps as java.sql.Timestamp objects (but that might
> not be necessary):
>
> import java.sql.Timestamp
> case class Times(value: Timestamp)
> timestampDF.as[Times].map(t => t.value.getTimezoneOffset).show
> +-----+
> |value|
> +-----+
> | -120|
> +-----+
>
>
> Enrico
>
>
> Am 31.03.20 um 21:40 schrieb Chetan Khatri:
>
> Sorry misrepresentation the question also. Thanks for your great help.
>
> What I want is the time zone information as it is
> 2020-04-11T20:40:00-05:00 in timestamp datatype. so I can write to
> downstream application as it is. I can correct the lacking UTC offset info.
>
>
> On Tue, Mar 31, 2020 at 1:15 PM Magnus Nilsson <ma...@kth.se> wrote:
>
>> And to answer your question (sorry, read too fast). The string is not in
>> proper ISO8601. Extended form must be used throughout, ie
>> 2020-04-11T20:40:00-05:00, there's a colon (:) lacking in the UTC offset
>> info.
>>
>> br,
>>
>> Magnus
>>
>> On Tue, Mar 31, 2020 at 7:11 PM Magnus Nilsson <ma...@kth.se> wrote:
>>
>>> Timestamps aren't timezoned. If you parse ISO8601 strings they will be
>>> converted to UTC automatically.
>>>
>>> If you parse timestamps without timezone they will converted to the the
>>> timezone the server Spark is running on uses. You can change the timezone
>>> Spark uses with spark.conf.set("spark.sql.session.timeZone", "UTC").
>>> Timestamps represent a point in time, the clock representation of that
>>> instant is dependent on sparks timezone settings both for parsing (non
>>> ISO8601) strings and showing timestamps.
>>>
>>> br,
>>>
>>> Magnus
>>>
>>> On Tue, Mar 31, 2020 at 6:14 PM Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>>
>>>> Hi Spark Users,
>>>>
>>>> I am losing the timezone value from below format, I tried couple of
>>>> formats but not able to make it. Can someone throw lights?
>>>>
>>>> scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value")
>>>> sampleDF: org.apache.spark.sql.DataFrame = [value: string]
>>>>
>>>> scala> sampleDF.select('value, to_timestamp('value,
>>>> "yyyy-MM-dd\'T\'HH:mm:ss")).show(false)
>>>>
>>>> +------------------------+------------------------------------------------+
>>>> |value                   |to_timestamp(`value`,
>>>> 'yyyy-MM-dd\'T\'HH:mm:ss')|
>>>>
>>>> +------------------------+------------------------------------------------+
>>>> |2020-04-11T20:40:00-0500|2020-04-11 20:40:00
>>>>   |
>>>>
>>>> +------------------------+------------------------------------------------+
>>>>
>>>> Thanks
>>>>
>>>
>

Re: Unablee to get to_timestamp with Timezone Information

Posted by Enrico Minack <ma...@Enrico.Minack.dev>.
Once parsed into a Timestamp the timestamp is store internally as UTC 
and printed as your local timezone (e.g. as defined by 
spark.sql.session.timeZone). Spark is good at hiding timezone 
information from you.

You can get the timezone information via date_format(column, format):

import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.functions._

val sampleDF = Seq("2020-04-11T20:40:00-05:00").toDF("value")
val timestampDF = sampleDF.select($"value".cast(TimestampType))
timestampDF.select(date_format($"value", 
"yyyy-MM-dd'T'HH:mm:ssZZZZ")).show(false)
+---------------------------------------------+
|date_format(value, yyyy-MM-dd'T'HH:mm:ssZZZZ)|
+---------------------------------------------+
|2020-04-12T03:40:00+0200                     |
+---------------------------------------------+

If you want the timezone only, use 
timestampDF.select(date_format($"value", "ZZZZ")).show.
+------------------------+
|date_format(value, ZZZZ)|
+------------------------+
|                   +0200|
+------------------------+

It all depends how you get the data "downstream". If you go through 
parquet or csv files, they will retain the timezone information. If you 
go through strings, you should format them as above. If you use 
Dataset.map you can access the timestamps as java.sql.Timestamp objects 
(but that might not be necessary):

import java.sql.Timestamp
case class Times(value: Timestamp)
timestampDF.as[Times].map(t => t.value.getTimezoneOffset).show
+-----+
|value|
+-----+
| -120|
+-----+


Enrico


Am 31.03.20 um 21:40 schrieb Chetan Khatri:
> Sorry misrepresentation the question also. Thanks for your great help.
>
> What I want is the time zone information as it is 
> 2020-04-11T20:40:00-05:00 in timestamp datatype. so I can write to 
> downstream application as it is. I can correct the lacking UTC offset 
> info.
>
>
> On Tue, Mar 31, 2020 at 1:15 PM Magnus Nilsson <magnn@kth.se 
> <ma...@kth.se>> wrote:
>
>     And to answer your question (sorry, read too fast). The string is
>     not in proper ISO8601. Extended form must be used throughout, ie
>     2020-04-11T20:40:00-05:00, there's a colon (:) lacking in the UTC
>     offset info.
>
>     br,
>
>     Magnus
>
>     On Tue, Mar 31, 2020 at 7:11 PM Magnus Nilsson <magnn@kth.se
>     <ma...@kth.se>> wrote:
>
>         Timestamps aren't timezoned. If you parse ISO8601 strings they
>         will be converted to UTC automatically.
>
>         If you parse timestamps without timezone they will converted
>         to the the timezone the server Spark is running on uses. You
>         can change the timezone Spark uses with
>         spark.conf.set("spark.sql.session.timeZone","UTC"). Timestamps
>         represent a point in time, the clock representation of that
>         instant is dependent on sparks timezone settings both for
>         parsing (non ISO8601) strings and showing timestamps.
>
>         br,
>
>         Magnus
>
>         On Tue, Mar 31, 2020 at 6:14 PM Chetan Khatri
>         <chetan.opensource@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             Hi Spark Users,
>
>             I am losing the timezone value from below format, I tried
>             couple of formats but not able to make it. Can someone
>             throw lights?
>
>             scala> val sampleDF =
>             Seq("2020-04-11T20:40:00-0500").toDF("value")
>             sampleDF: org.apache.spark.sql.DataFrame = [value: string]
>
>             scala> sampleDF.select('value, to_timestamp('value,
>             "yyyy-MM-dd\'T\'HH:mm:ss")).show(false)
>             +------------------------+------------------------------------------------+
>             |value                   |to_timestamp(`value`,
>             'yyyy-MM-dd\'T\'HH:mm:ss')|
>             +------------------------+------------------------------------------------+
>             |2020-04-11T20:40:00-0500|2020-04-11 20:40:00            
>                           |
>             +------------------------+------------------------------------------------+
>
>             Thanks
>


Re: Unablee to get to_timestamp with Timezone Information

Posted by Chetan Khatri <ch...@gmail.com>.
Sorry misrepresentation the question also. Thanks for your great help.

What I want is the time zone information as it is 2020-04-11T20:40:00-05:00
in timestamp datatype. so I can write to downstream application as it is. I
can correct the lacking UTC offset info.


On Tue, Mar 31, 2020 at 1:15 PM Magnus Nilsson <ma...@kth.se> wrote:

> And to answer your question (sorry, read too fast). The string is not in
> proper ISO8601. Extended form must be used throughout, ie
> 2020-04-11T20:40:00-05:00, there's a colon (:) lacking in the UTC offset
> info.
>
> br,
>
> Magnus
>
> On Tue, Mar 31, 2020 at 7:11 PM Magnus Nilsson <ma...@kth.se> wrote:
>
>> Timestamps aren't timezoned. If you parse ISO8601 strings they will be
>> converted to UTC automatically.
>>
>> If you parse timestamps without timezone they will converted to the the
>> timezone the server Spark is running on uses. You can change the timezone
>> Spark uses with spark.conf.set("spark.sql.session.timeZone", "UTC").
>> Timestamps represent a point in time, the clock representation of that
>> instant is dependent on sparks timezone settings both for parsing (non
>> ISO8601) strings and showing timestamps.
>>
>> br,
>>
>> Magnus
>>
>> On Tue, Mar 31, 2020 at 6:14 PM Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>>
>>> Hi Spark Users,
>>>
>>> I am losing the timezone value from below format, I tried couple of
>>> formats but not able to make it. Can someone throw lights?
>>>
>>> scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value")
>>> sampleDF: org.apache.spark.sql.DataFrame = [value: string]
>>>
>>> scala> sampleDF.select('value, to_timestamp('value,
>>> "yyyy-MM-dd\'T\'HH:mm:ss")).show(false)
>>>
>>> +------------------------+------------------------------------------------+
>>> |value                   |to_timestamp(`value`,
>>> 'yyyy-MM-dd\'T\'HH:mm:ss')|
>>>
>>> +------------------------+------------------------------------------------+
>>> |2020-04-11T20:40:00-0500|2020-04-11 20:40:00
>>>   |
>>>
>>> +------------------------+------------------------------------------------+
>>>
>>> Thanks
>>>
>>

Re: Unablee to get to_timestamp with Timezone Information

Posted by Magnus Nilsson <ma...@kth.se>.
And to answer your question (sorry, read too fast). The string is not in
proper ISO8601. Extended form must be used throughout, ie
2020-04-11T20:40:00-05:00, there's a colon (:) lacking in the UTC offset
info.

br,

Magnus

On Tue, Mar 31, 2020 at 7:11 PM Magnus Nilsson <ma...@kth.se> wrote:

> Timestamps aren't timezoned. If you parse ISO8601 strings they will be
> converted to UTC automatically.
>
> If you parse timestamps without timezone they will converted to the the
> timezone the server Spark is running on uses. You can change the timezone
> Spark uses with spark.conf.set("spark.sql.session.timeZone", "UTC").
> Timestamps represent a point in time, the clock representation of that
> instant is dependent on sparks timezone settings both for parsing (non
> ISO8601) strings and showing timestamps.
>
> br,
>
> Magnus
>
> On Tue, Mar 31, 2020 at 6:14 PM Chetan Khatri <ch...@gmail.com>
> wrote:
>
>> Hi Spark Users,
>>
>> I am losing the timezone value from below format, I tried couple of
>> formats but not able to make it. Can someone throw lights?
>>
>> scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value")
>> sampleDF: org.apache.spark.sql.DataFrame = [value: string]
>>
>> scala> sampleDF.select('value, to_timestamp('value,
>> "yyyy-MM-dd\'T\'HH:mm:ss")).show(false)
>>
>> +------------------------+------------------------------------------------+
>> |value                   |to_timestamp(`value`,
>> 'yyyy-MM-dd\'T\'HH:mm:ss')|
>>
>> +------------------------+------------------------------------------------+
>> |2020-04-11T20:40:00-0500|2020-04-11 20:40:00
>> |
>>
>> +------------------------+------------------------------------------------+
>>
>> Thanks
>>
>

Re: Unablee to get to_timestamp with Timezone Information

Posted by Magnus Nilsson <ma...@kth.se>.
Timestamps aren't timezoned. If you parse ISO8601 strings they will be
converted to UTC automatically.

If you parse timestamps without timezone they will converted to the the
timezone the server Spark is running on uses. You can change the timezone
Spark uses with spark.conf.set("spark.sql.session.timeZone", "UTC").
Timestamps represent a point in time, the clock representation of that
instant is dependent on sparks timezone settings both for parsing (non
ISO8601) strings and showing timestamps.

br,

Magnus

On Tue, Mar 31, 2020 at 6:14 PM Chetan Khatri <ch...@gmail.com>
wrote:

> Hi Spark Users,
>
> I am losing the timezone value from below format, I tried couple of
> formats but not able to make it. Can someone throw lights?
>
> scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value")
> sampleDF: org.apache.spark.sql.DataFrame = [value: string]
>
> scala> sampleDF.select('value, to_timestamp('value,
> "yyyy-MM-dd\'T\'HH:mm:ss")).show(false)
> +------------------------+------------------------------------------------+
> |value                   |to_timestamp(`value`, 'yyyy-MM-dd\'T\'HH:mm:ss')|
> +------------------------+------------------------------------------------+
> |2020-04-11T20:40:00-0500|2020-04-11 20:40:00                             |
> +------------------------+------------------------------------------------+
>
> Thanks
>