You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Michael Bílý <mi...@gmail.com> on 2022/11/25 10:49:28 UTC
[PySpark] [applyInPandas] Regression Bug: Cogroup in pandas drops columns from the first dataframe
Hello there,
I ran into this problem on pyspark:
when using the groupby.cogroup functionality on the same dataframe, it
silently drops columns from the first instance, minimal example:
spark = (
SparkSession.builder
.getOrCreate()
)
df = spark.createDataFrame([["2017-08-17", 1,]], schema=["day",
"value"]).cache()
def in_pandas(df1, df2):
assert "value" in df1.columns
return df1
df = (
df
.groupby("day")
.cogroup(df.groupby("day"))
.applyInPandas(
in_pandas,
schema=df.schema,
)
)
df.show(20, False)
Fails on assertion error....
My versions:
import pyspark.version
import pandas as pd
import pyarrow
print(sys.version)
# 3.8.10 (default, Jun 22 2022, 20:18:18)
# [GCC 9.4.0]
print(pyspark.version.__version__)
# 3.3.1
print(pd.__version__)
# 1.5.2
print(pyarrow.__version__)
# 10.0.1
It works on AWS Glue session with these versions:
[image: image.png]
It prints:
+----------+-----+
|day |value|
+----------+-----+
|2017-08-17|1 |
+----------+-----+
as expected.
Thank you,
Michael