You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Mendelson, Assaf" <As...@rsa.com> on 2017/05/10 16:28:36 UTC

incremental broadcast join

Hi,

It seems as if when doing broadcast join, the entire dataframe is resent even if part of it has already been broadcasted.

Consider the following case:

val df1 = ???
val df2 = ???
val df3 = ???

df3.join(broadcast(df1), on=cond, "left_outer")
followed by
df4.join(broadcast(df1.union(df2), on=cond, "left_outer")

I would expect the second broadcast to only broadcast the difference. However, if I do explain(true) I see the entire union is broadcast.

My use case is that I have a series of dataframes on which I need to do some enrichment, joining them with a small dataframe. The small dataframe gets additional information (as the result of each aggregation).

Is there an efficient way of doing this?

Thanks,
              Assaf.