You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Franck Tago (JIRA)" <ji...@apache.org> on 2016/10/10 22:54:20 UTC

[jira] [Created] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.

Franck Tago created SPARK-17859:
-----------------------------------

             Summary: persist should not impede with spark's ability to perform a broadcast join.
                 Key: SPARK-17859
                 URL: https://issues.apache.org/jira/browse/SPARK-17859
             Project: Spark
          Issue Type: Bug
          Components: Optimizer
    Affects Versions: 2.0.0
         Environment: spark 2.0.0 , Linux RedHat
            Reporter: Franck Tago


I am using Spark 2.0.0 

My investigation leads me to conclude that calling persist could prevent broadcast join  from happening .

Example
Case1: No persist call 
var  df1 =spark.range(1000000).select($"id".as("id1"))
df1: org.apache.spark.sql.DataFrame = [id1: bigint]

 var df2 =spark.range(1000).select($"id".as("id2"))
df2: org.apache.spark.sql.DataFrame = [id2: bigint]

 df1.join(df2 , $"id1" === $"id2" ).explain 
== Physical Plan ==
*BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight
:- *Project [id#114L AS id1#117L]
:  +- *Range (0, 1000000, splits=2)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Project [id#120L AS id2#123L]
      +- *Range (0, 1000, splits=2)

Case 2:  persist call 

 df1.persist.join(df2 , $"id1" === $"id2" ).explain 
16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
== Physical Plan ==
*SortMergeJoin [id1#3L], [id2#9L], Inner
:- *Sort [id1#3L ASC], false, 0
:  +- Exchange hashpartitioning(id1#3L, 10)
:     +- InMemoryTableScan [id1#3L]
:        :  +- InMemoryRelation [id1#3L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
:        :     :  +- *Project [id#0L AS id1#3L]
:        :     :     +- *Range (0, 1000000, splits=2)
+- *Sort [id2#9L ASC], false, 0
   +- Exchange hashpartitioning(id2#9L, 10)
      +- InMemoryTableScan [id2#9L]
         :  +- InMemoryRelation [id2#9L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         :     :  +- *Project [id#6L AS id2#9L]
         :     :     +- *Range (0, 1000, splits=2)


Why does the persist call prevent the broadcast join . 

My opinion is that it should not .

I was made aware that the persist call is  lazy and that might have something to do with it , but I still contend that it should not . 

Losing broadcast joins is really costly.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org