You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lu Liu <li...@gmail.com> on 2019/03/18 09:28:16 UTC
Reuse broadcasted data frame in multiple query
Hi there,
I have a question regarding reuse broadcast data frame in multiple queries, not quite sure whether it is possible or not and hope someone can shed some light here.. Pseudo code as below:
val df_schedule = spark.sql(“select …. from A”)
val df_schedule_broadcast = broadcast(df_schedule)
df_schedule_broadcast.createOrReplaceTempView(“tbl_schedule”)
// query1
res_df1 = spark.sql(“select …. from B, tbl_schedule S where B.col1 = S.col1”)
…..
// query2
res_df2 = spark.sql(“select …. from C, tbl_schedule S where C.col2 = S.col2”)
From SparkUI, I can see 2 jobs ThreadPoolExecutor that having same DAG, cost similar time to complete, seems like the table df_schedule_broadcast broadcasted twice, so my question is is there anyway I can avoid broadcasting again? Just reuse the same data in two queries
Many thanks,
LL