You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lu Liu <li...@gmail.com> on 2019/03/18 09:28:16 UTC

Reuse broadcasted data frame in multiple query

Hi there,
    I have a question regarding reuse broadcast data frame in multiple queries, not quite sure whether it is possible or not and hope someone can shed some light here.. Pseudo code as below:

    val df_schedule = spark.sql(“select …. from A”)
    val df_schedule_broadcast = broadcast(df_schedule)
    df_schedule_broadcast.createOrReplaceTempView(“tbl_schedule”)
    // query1
    res_df1 = spark.sql(“select …. from B, tbl_schedule S where B.col1 = S.col1”)

    …..
    // query2
    res_df2 = spark.sql(“select …. from C, tbl_schedule S where C.col2 = S.col2”)
    
   From SparkUI, I can see 2 jobs ThreadPoolExecutor that having same DAG, cost similar time to complete, seems like the table df_schedule_broadcast broadcasted twice,  so my question is is there anyway I can avoid broadcasting again? Just reuse the same data in two queries




Many thanks,

LL