You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sebastian Eckweiler (Jira)" <ji...@apache.org> on 2021/05/11 06:56:00 UTC
[jira] [Created] (SPARK-35367) Window group-key and pandas
inconsistent
Sebastian Eckweiler created SPARK-35367:
-------------------------------------------
Summary: Window group-key and pandas inconsistent
Key: SPARK-35367
URL: https://issues.apache.org/jira/browse/SPARK-35367
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.1
Reporter: Sebastian Eckweiler
Not completely sure whether this is a bug or a configuration/usage issue:
We are seeing inconsistent timezone-treatment when using a windowed group by aggregation in combiation with a pandas udf.
A minimal example:
{code:java}
// code placeholder
def a_udf(group_key, pdf: pd.DataFrame) -> pd.DataFrame:
w_start = group_key[0]["start"]
w_end = group_key[0]["end"]
print(f"Pandas : {pdf['window_start'].iloc[0]} to {pdf['window_end'].iloc[0]}")
print(f"Group key: {w_start} to {w_end}")
print(f"Data : {pdf['time'].min()} to {pdf['time'].max()}")
assert (pdf["time"] >= w_start).all()
assert (pdf["time"] < w_end).all()
# some result
return pd.DataFrame.from_records([{"result": 1}])
df = spark.createDataFrame([(datetime.datetime(2020, 1, 1, 12, 30, 0),)], schema=["time"])
w = window("time", "60 minutes")
df.withColumn("window_start", w.start).withColumn("window_end", w.end).groupby(w).applyInPandas(a_udf, schema="result int").show()
{code}
Produces:
{code:java}
Pandas : 2020-01-01 12:00:00 to 2020-01-01 13:00:00
Group key: 2020-01-01 11:00:00 to 2020-01-01 12:00:00
Data : 2020-01-01 12:30:00 to 2020-01-01 12:30:00
{code}
And the assertions fail. It seems the group-key goes through same timezone- (and dst?) conversion that ends up being one hour off.
This is without any specific timezone configuration and with CEST as the local timezone.
Is this working as expected?
Setting
{code:java}
"spark.sql.session.timeZone": "UTC"
"spark.driver.extraJavaOptions": "-Duser.timezone=UTC"
"spark.executor.extraJavaOptions": "-Duser.timezone=UTC"
{code}
seems to be workaround.
Using "Europe/Berlin" for all three timezone settings however reproduces the inconsistent behaviour.
I would assume though, that running this in a non-UTC timezone should generally be possible?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org