You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tanel Kiis (Jira)" <ji...@apache.org> on 2022/10/05 07:59:00 UTC

[jira] [Commented] (SPARK-40664) Union in query can remove cache from the plan

    [ https://issues.apache.org/jira/browse/SPARK-40664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612881#comment-17612881 ] 

Tanel Kiis commented on SPARK-40664:
------------------------------------

I do not think that https://github.com/apache/spark/pull/35214 has a bug, but it instead revealed an existing bug in cache management.

> Union in query can remove cache from the plan
> ---------------------------------------------
>
>                 Key: SPARK-40664
>                 URL: https://issues.apache.org/jira/browse/SPARK-40664
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Tanel Kiis
>            Priority: Major
>
> Failing unitest:
> {code}
>   test("SPARK-40664: Cache with join, union and renames") {
>     val df1 = Seq("1", "2").toDF("a")
>     val df2 = Seq("2", "3").toDF("a")
>       .withColumn("b", lit("b"))
>     val joined = df1.join(broadcast(df2), "a")
>       // Messing around the column can cause some problems with cache manager
>       .withColumn("tmp_b", $"b")
>       .drop("b")
>       .withColumnRenamed("tmp_b", "b")
>       .cache()
>     val unioned = joined.union(joined)
>     assertCached(unioned, 2)
>   }
> {code}
> After this PR the test started failing: https://github.com/apache/spark/pull/35214
> Plan before:
> {code}
> Union
> :- InMemoryTableScan [a#4, b#23]
> :     +- InMemoryRelation [a#4, b#23], StorageLevel(disk, memory, deserialized, 1 replicas)
> :           +- *(2) Project [a#4, b AS b#23]
> :              +- *(2) BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
> :                 :- *(2) Project [value#1 AS a#4]
> :                 :  +- *(2) Filter isnotnull(value#1)
> :                 :     +- *(2) LocalTableScan [value#1]
> :                 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#35]
> :                    +- *(1) Project [value#7 AS a#10]
> :                       +- *(1) Filter isnotnull(value#7)
> :                          +- *(1) LocalTableScan [value#7]
> +- InMemoryTableScan [a#4, b#23]
>       +- InMemoryRelation [a#4, b#23], StorageLevel(disk, memory, deserialized, 1 replicas)
>             +- *(2) Project [a#4, b AS b#23]
>                +- *(2) BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
>                   :- *(2) Project [value#1 AS a#4]
>                   :  +- *(2) Filter isnotnull(value#1)
>                   :     +- *(2) LocalTableScan [value#1]
>                   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#35]
>                      +- *(1) Project [value#7 AS a#10]
>                         +- *(1) Filter isnotnull(value#7)
>                            +- *(1) LocalTableScan [value#7]
> {code}
> Plan after:
> {code}
> AdaptiveSparkPlan isFinalPlan=false
> +- Union
>    :- Project [a#4, b AS b#23]
>    :  +- BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
>    :     :- Project [value#1 AS a#4]
>    :     :  +- Filter isnotnull(value#1)
>    :     :     +- LocalTableScan [value#1]
>    :     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#115]
>    :        +- Project [value#7 AS a#10]
>    :           +- Filter isnotnull(value#7)
>    :              +- LocalTableScan [value#7]
>    +- Project [a#4, b AS b#39]
>       +- BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
>          :- Project [value#36 AS a#4]
>          :  +- Filter isnotnull(value#36)
>          :     +- LocalTableScan [value#36]
>          +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#118]
>             +- Project [value#37 AS a#10]
>                +- Filter isnotnull(value#37)
>                   +- LocalTableScan [value#37]
> {code}
> (The InMemoryTableScan is missing)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org