You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael Bílý <mi...@gmail.com> on 2022/11/25 10:49:28 UTC

[PySpark] [applyInPandas] Regression Bug: Cogroup in pandas drops columns from the first dataframe

Hello there,

I ran into this problem on pyspark:
when using the groupby.cogroup functionality on the same dataframe, it
silently drops columns from the first instance, minimal example:
spark = (
    SparkSession.builder
    .getOrCreate()
)

df = spark.createDataFrame([["2017-08-17", 1,]], schema=["day",
"value"]).cache()

def in_pandas(df1, df2):
    assert "value" in df1.columns
    return df1

df = (
    df
    .groupby("day")
    .cogroup(df.groupby("day"))
    .applyInPandas(
        in_pandas,
        schema=df.schema,
    )
)

df.show(20, False)

Fails on assertion error....

My versions:
import pyspark.version
import pandas as pd
import pyarrow

print(sys.version)
# 3.8.10 (default, Jun 22 2022, 20:18:18)
# [GCC 9.4.0]
print(pyspark.version.__version__)
# 3.3.1
print(pd.__version__)
# 1.5.2
print(pyarrow.__version__)
# 10.0.1

It works on AWS Glue session with these versions:
[image: image.png]
It prints:
+----------+-----+
|day       |value|
+----------+-----+
|2017-08-17|1    |
+----------+-----+

as expected.

Thank you,
Michael