You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/01/20 11:22:00 UTC

[jira] [Assigned] (SPARK-42132) DeduplicateRelations rule breaks plan when co-grouping the same DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-42132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-42132:
------------------------------------

    Assignee:     (was: Apache Spark)

> DeduplicateRelations rule breaks plan when co-grouping the same DataFrame
> -------------------------------------------------------------------------
>
>                 Key: SPARK-42132
>                 URL: https://issues.apache.org/jira/browse/SPARK-42132
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.3.1, 3.2.3, 3.4.0
>            Reporter: Enrico Minack
>            Priority: Major
>              Labels: correctness
>
> Co-grouping two DataFrames that share references breaks on the DeduplicateRelations rule:
> {code:java}
> val df = spark.range(3)
> val left_grouped_df = df.groupBy("id").as[Long, Long]
> val right_grouped_df = df.groupBy("id").as[Long, Long]
> val cogroup_df = left_grouped_df.cogroup(right_grouped_df) {
>   case (key, left, right) => left
> }
> cogroup_df.explain()
> {code}
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject [input[0, bigint, false] AS value#12L]
>    +- CoGroup, id#0: bigint, id#0: bigint, id#0: bigint, [id#13L], [id#13L], [id#13L], [id#13L], obj#11: bigint
>       :- !Sort [id#13L ASC NULLS FIRST], false, 0
>       :  +- !Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, [plan_id=16]
>       :     +- Range (0, 3, step=1, splits=16)
>       +- Sort [id#13L ASC NULLS FIRST], false, 0
>          +- Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, [plan_id=17]
>             +- Range (0, 3, step=1, splits=16)
> {code}
> The DataFrame cannot be computed:
> {code:java}
> cogroup_df.show()
> {code}
> {code:java}
> java.lang.IllegalStateException: Couldn't find id#13L in [id#0L]
> {code}
> The rule replaces `id#0L` on the right side with `id#13L` while replacing all occurrences in `CoGroup`. Some occurrences of `id#0L` in `CoGroup`refer to the left side and should not be replaced. Further, `id#0L` of the right deserializer is not replaced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org