You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Toby Harradine (Jira)" <ji...@apache.org> on 2020/06/23 02:10:00 UTC

[jira] [Comment Edited] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

    [ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142533#comment-17142533 ] 

Toby Harradine edited comment on SPARK-25244 at 6/23/20, 2:09 AM:
------------------------------------------------------------------

Hi,

I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a difficult bug to work around when trying to validate datetimes in unit tests, which run on different machines with different timezones (and I'd prefer not to require use of Pandas to run unit tests).

Was this issue closed without resolution?

_Edit: Just tested on PySpark 3.0.0 with same outcome_.

Regards,
 Toby


was (Author: toby.harradine):
Hi,

I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a difficult bug to work around when trying to validate datetimes in unit tests, which run on different machines with different timezones (and I'd prefer not to require use of Pandas to run unit tests).

Was this issue closed without resolution?

Regards,
Toby

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> ----------------------------------------------------------------------
>
>                 Key: SPARK-25244
>                 URL: https://issues.apache.org/jira/browse/SPARK-25244
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1
>            Reporter: Anton Daitche
>            Priority: Major
>              Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>          .sql
>          .SparkSession
>          .builder
>          .master('local[1]')
>          .config("spark.sql.session.timeZone", "UTC")
>          .getOrCreate()
>         )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone.
> The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org