You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2023/01/20 10:54:00 UTC

[jira] [Created] (SPARK-42132) DeduplicateRelations rule breaks plan when co-grouping the same DataFrame

Enrico Minack created SPARK-42132:
-------------------------------------

             Summary: DeduplicateRelations rule breaks plan when co-grouping the same DataFrame
                 Key: SPARK-42132
                 URL: https://issues.apache.org/jira/browse/SPARK-42132
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.3, 3.3.1, 3.3.0, 3.1.3, 3.0.3, 3.4.0
            Reporter: Enrico Minack


Co-grouping two DataFrames that share references breaks on the DeduplicateRelations rule:
{code:java}
val df = spark.range(3)

val left_grouped_df = df.groupBy("id").as[Long, Long]
val right_grouped_df = df.groupBy("id").as[Long, Long]

val cogroup_df = left_grouped_df.cogroup(right_grouped_df) {
  case (key, left, right) => left
}

cogroup_df.explain()
{code}
{code:java}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SerializeFromObject [input[0, bigint, false] AS value#12L]
   +- CoGroup, id#0: bigint, id#0: bigint, id#0: bigint, [id#13L], [id#13L], [id#13L], [id#13L], obj#11: bigint
      :- !Sort [id#13L ASC NULLS FIRST], false, 0
      :  +- !Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, [plan_id=16]
      :     +- Range (0, 3, step=1, splits=16)
      +- Sort [id#13L ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, [plan_id=17]
            +- Range (0, 3, step=1, splits=16)
{code}

The DataFrame cannot be computed:
{code:java}
cogroup_df.show()
{code}
{code:java}
java.lang.IllegalStateException: Couldn't find id#13L in [id#0L]
{code}

The rule replaces `id#0L` on the right side with `id#13L` while replacing all occurrences in `CoGroup`. Some occurrences of `id#0L` in `CoGroup`refer to the left side and should not be replaced. Further, `id#0L` of the right deserializer is not replaced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org