You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ted Chester Jenks (Jira)" <ji...@apache.org> on 2023/02/10 10:27:00 UTC
[jira] [Created] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`
Ted Chester Jenks created SPARK-42397:
-----------------------------------------
Summary: Inconsistent data produced by `FlatMapCoGroupsInPandas`
Key: SPARK-42397
URL: https://issues.apache.org/jira/browse/SPARK-42397
Project: Spark
Issue Type: Bug
Components: Pandas API on Spark, SQL
Affects Versions: 3.3.1, 3.3.0
Reporter: Ted Chester Jenks
We are seeing inconsistent data returned when using `FlatMapCoGroupsInPandas`. In the PySpark example:
```
test_df = spark.createDataFrame(
[
["1", "23", "abc", "blah", "def", "1"],
["1", "23", "abc", "blah", "def", "1"],
["1", "23", "abc", "blah", "def", "2"],
["1", "23", "abc", "blah", "def", "2"],
],
["cluster", "partition", "event", "abc", "def", "one_or_two"]
)
df1 = test_df.filter(
F.col("one_or_two") == "1"
).select(
"cluster", "event", "abc"
)
df2 = test_df.filter(
F.col("one_or_two") == "2"
).select(
"cluster", "event", "def"
)
def get_schema(l, r):
return pd.DataFrame(
[(str(l.columns), str(r.columns))],
columns=["left_colms", "right_colms"]
)
grouped_df = df1.groupBy("cluster").cogroup(df2.groupBy("cluster")).applyInPandas(
get_schema, "left_colms string, right_colms string"
)
grouped_df_1 = grouped_df.withColumn(
"xyz", F.lit("1234")
)
```
When we call `grouped_df.collect()` we get:
```
[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", right_colms="Index(['cluster', 'event', 'def'], dtype='object')")]
```
When we call `grouped_df.show(5, truncate=False)` we get:
```
+-----------------------------------------+--------------------------------------------------+
|left_colms |right_colms |
+-----------------------------------------+--------------------------------------------------+
|Index(['cluster', 'abc'], dtype='object')|Index(['cluster', 'event', 'def'], dtype='object')|
+-----------------------------------------+--------------------------------------------------+
```
When we call `grouped_df_1.collect()` we get:
```
[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", right_colms="Index(['cluster', 'event', 'def'], dtype='object')", xyz='1234')]
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org