You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by tc...@gmail.com on 2020/06/11 00:09:10 UTC

Broadcast join data reuse

We have a case where data the is small enough to be broadcasted in joined
with multiple tables in a single plan. Looking at the physical plan, I do
not see anything that indicates if the broadcast data is done only once
i.e., the BroadcastExchange is being reused i.i.e., that data is not
redistributed from scratch. Could someone with insight into the physical
plan strategy for such a case confirm whether previous broadcasted data is
reused or if subsequent BroadcastExechange steps are done from scratch. 

 

Thanks and best regards,

Tyson


Re: Broadcast join data reuse

Posted by gypsysunny <su...@126.com>.
The broadcasted table can't seem to be resued across multiple actions.
e.g.
val small_df_bc = broadcast(small_df)
big_df1.join(small_df_bc, Seq("id")).write.parquet("/test1")
big_df2.join(small_df_bc, Seq("id")).write.parquet("/test2")

we can tell the small df has been distributed twice in the spark web UI.

so how can we make it happen only once?

thanks a million.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Broadcast join data reuse

Posted by Ankur Srivastava <an...@gmail.com>.
Hi Tyson,

The broadcast variable should remain in-memory of the executors and reused
unless you unpersist, destroy it or it goes out of context.

Hope this helps.

Thanks
Ankur

On Wed, Jun 10, 2020 at 5:28 PM <tc...@gmail.com> wrote:

> We have a case where data the is small enough to be broadcasted in joined
> with multiple tables in a single plan. Looking at the physical plan, I do
> not see anything that indicates if the broadcast data is done only once
> i.e., the BroadcastExchange is being reused i.i.e., that data is not
> redistributed from scratch. Could someone with insight into the physical
> plan strategy for such a case confirm whether previous broadcasted data is
> reused or if subsequent BroadcastExechange steps are done from scratch.
>
>
>
> Thanks and best regards,
>
> Tyson
>