You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/12/23 07:38:00 UTC
[jira] [Commented] (SPARK-33863) Pyspark UDF changes timestamps to
UTC
[ https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253937#comment-17253937 ]
Hyukjin Kwon commented on SPARK-33863:
--------------------------------------
[~nasirali] Can you show expected results and actual results of {{df.show()}}?
> Pyspark UDF changes timestamps to UTC
> -------------------------------------
>
> Key: SPARK-33863
> URL: https://issues.apache.org/jira/browse/SPARK-33863
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.0.1
> Environment: MAC/Linux
> Standalone cluster / local machine
> Reporter: Nasir Ali
> Priority: Major
>
> *Problem*:
> If I create a new column using udf, pyspark udf changes timestamps into UTC time. I have used following configs to let spark know the timestamps are in UTC:
>
> {code:java}
> --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC
> --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
> --conf spark.sql.session.timeZone=UTC
> {code}
> Below is a code snippet to reproduce the error:
>
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql.types import StringType
> import datetime
> spark = SparkSession.builder.config("spark.sql.session.timeZone", "UTC").getOrCreate()
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
> ["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.show(truncate=False)
> def some_time_udf(i):
> tmp=""
> if datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
> tmp="Morning -> "+str(i)
> elif datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
> tmp= "Afternoon -> "+str(i)
> elif datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
> tmp= "Evening -> "+str(i)
> elif datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
> tmp= "Night -> "+str(i)
> elif datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
> tmp= "Night -> "+str(i)
> return tmpsometimeudf = F.udf(some_time_udf,StringType())df.withColumn("day_part", sometimeudf("ts")).show(truncate=False)
> {code}
> I have concatenated timestamps with the string to show that pyspark pass timestamps as UTC.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org