You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by zhangliyun <ke...@126.com> on 2019/05/26 09:38:04 UTC

A question about broadcast join in spark2.2.0 and spark2.3.1

Hi all:
  i want to ask a question about how to change common join to broadcast join.
  I have a query like in spark 2.2.0 and spark 2.3.1 seperately
```
select * from Aleft join B
on A.mth_id - 12 =  B.mth_id  and A.email = B.email.
```
 The value of spark.sql.autoBroadcastJoinThreshold is same ( spark.sql.autoBroadcastJoinThreshold=700000000 700M), but for some reason, it became 
broadcast join in 2.2.0 while  a sort merged join in spark 2.3.1.  It is strange that in spark 2.2.0 the broadcast data size is 0 bytes but actually i think the value is not 0.  The problem here this query should use broadcast join but i don't know why the broadcast data size is 0 in spark2.2 and why this  is not a broadcast join in 2.3.1. Is there any difference about the threshold of broadcast join in spark 2.2.0 and 2.3.1? Is the parameter "spark.sql.autoBroadcastJoinThreshold" only factor which decide the broadcast or sort merge join? Appreciate that if you can give me some suggestion.