You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tanel Kiis (Jira)" <ji...@apache.org> on 2022/10/05 07:59:00 UTC
[jira] [Commented] (SPARK-40664) Union in query can remove cache from the plan
[ https://issues.apache.org/jira/browse/SPARK-40664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612881#comment-17612881 ]
Tanel Kiis commented on SPARK-40664:
------------------------------------
I do not think that https://github.com/apache/spark/pull/35214 has a bug, but it instead revealed an existing bug in cache management.
> Union in query can remove cache from the plan
> ---------------------------------------------
>
> Key: SPARK-40664
> URL: https://issues.apache.org/jira/browse/SPARK-40664
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.3.0
> Reporter: Tanel Kiis
> Priority: Major
>
> Failing unitest:
> {code}
> test("SPARK-40664: Cache with join, union and renames") {
> val df1 = Seq("1", "2").toDF("a")
> val df2 = Seq("2", "3").toDF("a")
> .withColumn("b", lit("b"))
> val joined = df1.join(broadcast(df2), "a")
> // Messing around the column can cause some problems with cache manager
> .withColumn("tmp_b", $"b")
> .drop("b")
> .withColumnRenamed("tmp_b", "b")
> .cache()
> val unioned = joined.union(joined)
> assertCached(unioned, 2)
> }
> {code}
> After this PR the test started failing: https://github.com/apache/spark/pull/35214
> Plan before:
> {code}
> Union
> :- InMemoryTableScan [a#4, b#23]
> : +- InMemoryRelation [a#4, b#23], StorageLevel(disk, memory, deserialized, 1 replicas)
> : +- *(2) Project [a#4, b AS b#23]
> : +- *(2) BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
> : :- *(2) Project [value#1 AS a#4]
> : : +- *(2) Filter isnotnull(value#1)
> : : +- *(2) LocalTableScan [value#1]
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#35]
> : +- *(1) Project [value#7 AS a#10]
> : +- *(1) Filter isnotnull(value#7)
> : +- *(1) LocalTableScan [value#7]
> +- InMemoryTableScan [a#4, b#23]
> +- InMemoryRelation [a#4, b#23], StorageLevel(disk, memory, deserialized, 1 replicas)
> +- *(2) Project [a#4, b AS b#23]
> +- *(2) BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
> :- *(2) Project [value#1 AS a#4]
> : +- *(2) Filter isnotnull(value#1)
> : +- *(2) LocalTableScan [value#1]
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#35]
> +- *(1) Project [value#7 AS a#10]
> +- *(1) Filter isnotnull(value#7)
> +- *(1) LocalTableScan [value#7]
> {code}
> Plan after:
> {code}
> AdaptiveSparkPlan isFinalPlan=false
> +- Union
> :- Project [a#4, b AS b#23]
> : +- BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
> : :- Project [value#1 AS a#4]
> : : +- Filter isnotnull(value#1)
> : : +- LocalTableScan [value#1]
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#115]
> : +- Project [value#7 AS a#10]
> : +- Filter isnotnull(value#7)
> : +- LocalTableScan [value#7]
> +- Project [a#4, b AS b#39]
> +- BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
> :- Project [value#36 AS a#4]
> : +- Filter isnotnull(value#36)
> : +- LocalTableScan [value#36]
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#118]
> +- Project [value#37 AS a#10]
> +- Filter isnotnull(value#37)
> +- LocalTableScan [value#37]
> {code}
> (The InMemoryTableScan is missing)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org