You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/06/29 12:10:01 UTC

[jira] [Assigned] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

     [ https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-32123:
------------------------------------

    Assignee: Apache Spark

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> ----------------------------------------------------------------------
>
>                 Key: SPARK-32123
>                 URL: https://issues.apache.org/jira/browse/SPARK-32123
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0
>            Reporter: Toby Harradine
>            Assignee: Apache Spark
>            Priority: Major
>
> Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.
> The setting {{spark.sql.session.timeZone}} is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons {{datetime}} objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>          .sql
>          .SparkSession
>          .builder
>          .master('local[1]')
>          .config("spark.sql.session.timeZone", "UTC")
>          .getOrCreate()
>         )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method {{toPandas}} respected the timezone setting (UTC), but the method {{collect}} ignored it and converted the timestamp to my systems timezone.
> The cause for this behaviour is that the methods {{toInternal}} and {{fromInternal}} of PySparks {{TimestampType}} class don't take into account the setting {{spark.sql.session.timeZone}} and use the system timezone.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org