You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "MaxGekk (via GitHub)" <gi...@apache.org> on 2023/03/12 10:11:37 UTC

[GitHub] [spark] MaxGekk commented on pull request #39239: [SPARK-41730][PYTHON] Set tz to UTC while converting of timestamps to python's datetime

MaxGekk commented on PR #39239:
URL: https://github.com/apache/spark/pull/39239#issuecomment-1465148669

   @HyukjinKwon @cloud-fan A problem of PySpark's timestamp_ltz is it is a local timestamp, and not a physical timestamp that timestamp_ltz is supposed to be. Let's see even Java 7 timestamp (it also has some issues but it is still a physical time point not local timestamp):
   ```
   ➜  ~ TZ=America/Los_Angeles ./spark-3.3/bin/spark-shell
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
         /_/
   scala> spark.conf.set("spark.sql.session.timeZone", "Europe/Moscow")
   
   scala> val df = sql("select timestamp'1970-01-01T00:00:00+0000'")
   df: org.apache.spark.sql.DataFrame = [TIMESTAMP '1970-01-01 03:00:00': timestamp]
   scala> df.collect()(0).getTimestamp(0).toGMTString
   res1: String = 1 Jan 1970 00:00:00 GMT
   ```
   The timestamp still knows how to show itself in the UTC time zone because it contains an offset from the epoch, and can render itself in any time zone.
   
   Let's see at PySpark timestamp:
   ```
   $ TZ=America/Los_Angeles ./spark-3.3/bin/pyspark
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 3.3.2
         /_/
   >>> spark.conf.set("spark.sql.session.timeZone", "Europe/Moscow")
   >>>
   >>> df = sql("select timestamp'1970-01-01T00:00:00+0000'")
   >>> df.collect()[0][0].utctimetuple()
   time.struct_time(tm_year=1969, tm_mon=12, tm_mday=31, tm_hour=16, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=365, tm_isdst=0)
   ```
   Since python's timestamp becomes a local timestamp at the conversion:
   ```python
               return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
   ```
   it cannot render itself in the UTC time zone correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org