You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ted Chester Jenks (Jira)" <ji...@apache.org> on 2023/02/10 10:27:00 UTC

[jira] [Created] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`

Ted Chester Jenks created SPARK-42397:
-----------------------------------------

             Summary: Inconsistent data produced by `FlatMapCoGroupsInPandas`
                 Key: SPARK-42397
                 URL: https://issues.apache.org/jira/browse/SPARK-42397
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark, SQL
    Affects Versions: 3.3.1, 3.3.0
            Reporter: Ted Chester Jenks


We are seeing inconsistent data returned when using `FlatMapCoGroupsInPandas`. In the PySpark example:

```

    test_df = spark.createDataFrame(
        [
            ["1", "23", "abc", "blah", "def", "1"],
            ["1", "23", "abc", "blah", "def", "1"],
            ["1", "23", "abc", "blah", "def", "2"],
            ["1", "23", "abc", "blah", "def", "2"],
        ],
        ["cluster", "partition", "event", "abc", "def", "one_or_two"]
    )
    df1 = test_df.filter(
        F.col("one_or_two") == "1"
    ).select(
        "cluster", "event", "abc"
    )

    df2 = test_df.filter(
        F.col("one_or_two") == "2"
    ).select(
        "cluster", "event", "def"
    )
    def get_schema(l, r):
            return pd.DataFrame(
                [(str(l.columns), str(r.columns))],
                columns=["left_colms", "right_colms"]
            )


   grouped_df = df1.groupBy("cluster").cogroup(df2.groupBy("cluster")).applyInPandas(
        get_schema, "left_colms string, right_colms string"
    )
    grouped_df_1 = grouped_df.withColumn(
       "xyz", F.lit("1234")
     )

```

When we call `grouped_df.collect()` we get:

```

[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] 

```

When we call `grouped_df.show(5, truncate=False)` we get:

```

+-----------------------------------------+--------------------------------------------------+
|left_colms                               |right_colms                                       |
+-----------------------------------------+--------------------------------------------------+
|Index(['cluster', 'abc'], dtype='object')|Index(['cluster', 'event', 'def'], dtype='object')|
+-----------------------------------------+--------------------------------------------------+

```

When we call `grouped_df_1.collect()` we get:

```

[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", right_colms="Index(['cluster', 'event', 'def'], dtype='object')", xyz='1234')] 

```

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org