You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Toby Harradine (Jira)" <ji...@apache.org> on 2020/06/28 22:41:00 UTC

[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

     [ https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Toby Harradine updated SPARK-32123:
-----------------------------------
    Affects Version/s:     (was: 2.3.1)
                       3.0.0

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> ----------------------------------------------------------------------
>
>                 Key: SPARK-32123
>                 URL: https://issues.apache.org/jira/browse/SPARK-32123
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0
>            Reporter: Toby Harradine
>            Priority: Major
>              Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>          .sql
>          .SparkSession
>          .builder
>          .master('local[1]')
>          .config("spark.sql.session.timeZone", "UTC")
>          .getOrCreate()
>         )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone.
> The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org