You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by karan alang <ka...@gmail.com> on 2023/06/08 22:49:31 UTC

Apache Spark not reading UTC timestamp from MongoDB correctly

ref :
https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly

Hello All,
I've data stored in MongoDB collection and the timestamp column is not
being read by Apache Spark correctly. I'm running Apache Spark on GCP
Dataproc.

Here is sample data :

-----

In Mongo :

timeslot_date  :
timeslot  |timeslot_date         |
+--------------------------+------1683527400|{2023-05-08T06:30:00Z}|


When I use pyspark to read this  :

+----------+-------------------+
timeslot  |timeslot_date      |
+----------+-------------------+1683527400|2023-05-07 23:30:00|
+----------------+-------+-----

-----

My understanding is, data in Mongo is in UTC format i.e.
2023-05-08T06:30:00Z is in UTC format. I'm in PST timezone. I'm not
clear why spark is reading it a different timezone format (neither PST
nor UTC) Note - it is not reading it as PST timezone, if it was doing
that it would advance the time by 7 hours, instead it is doing the
opposite.

Where is the default timezone format taken from, when Spark is reading
data from MongoDB ?

Any ideas on this ?

tia!

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

Posted by Enrico Minack <in...@enrico.minack.dev>.
Sean is right, casting timestamps to strings (which is what show() does) 
uses the local timezone, either the Java default zone `user.timezone`, 
the Spark default zone `spark.sql.session.timeZone` or the default 
DataFrameWriter zone `timeZone`(when writing to file).

You say you are in PST, which is UTC - 8 hours. But I think this 
currently observes daylight saving, so PDT, which is UTC - 7 hours.

Then, your UTC timestamp is correctly displayed in local PDT time. Try 
the change above settings to display in different timezones. Inspecting 
the underlying long value as suggested by Sean is best practice to get 
hold of the true timestamp.

Cheers,
Enrico


Am 09.06.23 um 00:53 schrieb Sean Owen:
> You sure it is not just that it's displaying in your local TZ? Check 
> the actual value as a long for example. That is likely the same time.
>
> On Thu, Jun 8, 2023, 5:50 PM karan alang <ka...@gmail.com> wrote:
>
>     ref :
>     https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
>
>     Hello All,
>     I've data stored in MongoDB collection and the timestamp column is
>     not being read by Apache Spark correctly. I'm running Apache Spark
>     on GCP Dataproc.
>
>     Here is sample data :
>
>     -----
>
>     |In Mongo : timeslot_date : timeslot |timeslot_date |
>     +--------------------------+------
>     1683527400|{2023-05-08T06:30:00Z}| When I use pyspark to read this
>     : +----------+-------------------+ timeslot |timeslot_date |
>     +----------+-------------------+ 1683527400|2023-05-07 23:30:00|
>     +----------------+-------+-----|
>
>     |-----|
>
>     |
>
>     My understanding is, data in Mongo is in UTC format i.e.
>     2023-05-08T06:30:00Z is in UTC format. I'm in PST timezone. I'm
>     not clear why spark is reading it a different timezone format
>     (neither PST nor UTC) Note - it is not reading it as PST timezone,
>     if it was doing that it would advance the time by 7 hours, instead
>     it is doing the opposite.
>
>     Where is the default timezone format taken from, when Spark is
>     reading data from MongoDB ?
>
>     Any ideas on this ?
>
>     tia!
>
>     |
>
>

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

Posted by Sean Owen <sr...@gmail.com>.
You sure it is not just that it's displaying in your local TZ? Check the
actual value as a long for example. That is likely the same time.

On Thu, Jun 8, 2023, 5:50 PM karan alang <ka...@gmail.com> wrote:

> ref :
> https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
>
> Hello All,
> I've data stored in MongoDB collection and the timestamp column is not
> being read by Apache Spark correctly. I'm running Apache Spark on GCP
> Dataproc.
>
> Here is sample data :
>
> -----
>
> In Mongo :
>
> timeslot_date  :
> timeslot  |timeslot_date         |
> +--------------------------+------1683527400|{2023-05-08T06:30:00Z}|
>
>
> When I use pyspark to read this  :
>
> +----------+-------------------+
> timeslot  |timeslot_date      |
> +----------+-------------------+1683527400|2023-05-07 23:30:00|
> +----------------+-------+-----
>
> -----
>
> My understanding is, data in Mongo is in UTC format i.e. 2023-05-08T06:30:00Z is in UTC format. I'm in PST timezone. I'm not clear why spark is reading it a different timezone format (neither PST nor UTC) Note - it is not reading it as PST timezone, if it was doing that it would advance the time by 7 hours, instead it is doing the opposite.
>
> Where is the default timezone format taken from, when Spark is reading data from MongoDB ?
>
> Any ideas on this ?
>
> tia!
>
>
>
>
>