You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sebastian Eckweiler (Jira)" <ji...@apache.org> on 2021/05/11 06:56:00 UTC
[jira] [Created] (SPARK-35367) Window group-key and pandas inconsistent

Sebastian Eckweiler created SPARK-35367:
-------------------------------------------

             Summary: Window group-key and pandas inconsistent
                 Key: SPARK-35367
                 URL: https://issues.apache.org/jira/browse/SPARK-35367
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.0.1
            Reporter: Sebastian Eckweiler


Not completely sure whether this is a bug or a configuration/usage issue:

We are seeing inconsistent timezone-treatment when using a windowed group by aggregation in combiation with a pandas udf.

A minimal example:

 
{code:java}
// code placeholder

def a_udf(group_key, pdf: pd.DataFrame) -> pd.DataFrame:
    w_start = group_key[0]["start"]
    w_end = group_key[0]["end"]

    print(f"Pandas   : {pdf['window_start'].iloc[0]} to {pdf['window_end'].iloc[0]}")
    print(f"Group key: {w_start} to {w_end}")
    print(f"Data     : {pdf['time'].min()} to {pdf['time'].max()}")

    assert (pdf["time"] >= w_start).all()
    assert (pdf["time"] < w_end).all()        

    # some result

    return pd.DataFrame.from_records([{"result": 1}])

    df = spark.createDataFrame([(datetime.datetime(2020, 1, 1, 12, 30, 0),)], schema=["time"])

    w = window("time", "60 minutes")
    df.withColumn("window_start", w.start).withColumn("window_end", w.end).groupby(w).applyInPandas(a_udf, schema="result int").show()
{code}
 

Produces:
{code:java}
Pandas   : 2020-01-01 12:00:00 to 2020-01-01 13:00:00
Group key: 2020-01-01 11:00:00 to 2020-01-01 12:00:00
Data     : 2020-01-01 12:30:00 to 2020-01-01 12:30:00

{code}
And the assertions fail. It seems the group-key goes through same timezone- (and dst?) conversion that ends up being one hour off.
 This is without any specific timezone configuration and with CEST as the local timezone.

Is this working as expected?

Setting
{code:java}
"spark.sql.session.timeZone": "UTC"
"spark.driver.extraJavaOptions": "-Duser.timezone=UTC"
"spark.executor.extraJavaOptions": "-Duser.timezone=UTC"
{code}
seems to be workaround.
 Using "Europe/Berlin" for all three timezone settings however reproduces the inconsistent behaviour.

I would assume though, that running this in a non-UTC timezone should generally be possible?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org