You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by zhangliyun <ke...@126.com> on 2019/10/23 10:02:25 UTC

A question about broadcast nest loop join

Hi all: 
i want to ask a question about broadcast nestloop join? from google i know, that 
 left outer/semi join and right outer/semi join will use broadcast nestloop.
  and in some cases, when the input data is very small, it is suitable to use. so here
  how to define the input data very small? what parameter decides the threshold?  I just want to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )




```


explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);


== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 = key3#92)))
   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
   +- BroadcastExchange IdentityBroadcastMode
      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]


```


  my question is 
  1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold= -1 
  2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.






Appreciate if you have suggestion




Best Regards


Kelly Zhang

Re:Re: A question about broadcast nest loop join

Posted by zhangliyun <ke...@126.com>.

Hi all:


    From google , I know that:
   ````


    Spark can only pick BroadcastNestedLoopJoin to implement left/right join. 
   ````
   
    but why I use  following case , broascastnestedLoopJoin became Sortmerged join when set spark.sql.autoBroadcastJoinThreshold=-1;


   {code}
set spark.sql.autoBroadcastJoinThreshold=-1;
  
 explain select a.key1, b.key3 from testdata1 as a  left join testdata3  as b where a.value1=b.value3;
== Physical Plan ==
*(5) Project [key1#18, key3#20]
+- *(5) SortMergeJoin [value1#19], [value3#21], Inner
   :- *(2) Sort [value1#19 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(value1#19, 200)
   :     +- *(1) Filter isnotnull(value1#19)
   :        +- HiveTableScan [key1#18, value1#19], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#18, value1#19]
   +- *(4) Sort [value3#21 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(value3#21, 200)
         +- *(3) Filter isnotnull(value3#21)
            +- HiveTableScan [key3#20, value3#21], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#20, value3#21]




    {code}






Appreciate to know your suggestion


Best Regards


Kelly Zhang














在 2019-10-23 22:12:35,"angers.zhu" <an...@gmail.com> 写道:



where not in ( query block)
condition will been change to LeftSemi join in optimizer rule RewritePredicateSubquery. 
Then as cloud-fan said,  it will be change to a BroadCastNestLoopJoin
| |
angers.zhu
|
|
angers.zhu@gmail.com
|
签名由网易邮箱大师定制


On 10/23/2019 20:55,Wenchen Fan<cl...@gmail.com> wrote:
I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens.


Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but currently Spark is only able to deal with it using broadcast. 


On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <ke...@126.com> wrote:

Hi all: 
i want to ask a question about broadcast nestloop join? from google i know, that 
 left outer/semi join and right outer/semi join will use broadcast nestloop.
  and in some cases, when the input data is very small, it is suitable to use. so here
  how to define the input data very small? what parameter decides the threshold?  I just want to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )




```


explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);


== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 = key3#92)))
   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
   +- BroadcastExchange IdentityBroadcastMode
      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]


```


  my question is 
  1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold= -1 
  2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.






Appreciate if you have suggestion




Best Regards


Kelly Zhang




 

Re:Re: A question about broadcast nest loop join

Posted by zhangliyun <ke...@126.com>.

Hi all:


    From google , I know that:
   ````


    Spark can only pick BroadcastNestedLoopJoin to implement left/right join. 
   ````
   
    but why I use  following case , broascastnestedLoopJoin became Sortmerged join when set spark.sql.autoBroadcastJoinThreshold=-1;


   {code}
set spark.sql.autoBroadcastJoinThreshold=-1;
  
 explain select a.key1, b.key3 from testdata1 as a  left join testdata3  as b where a.value1=b.value3;
== Physical Plan ==
*(5) Project [key1#18, key3#20]
+- *(5) SortMergeJoin [value1#19], [value3#21], Inner
   :- *(2) Sort [value1#19 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(value1#19, 200)
   :     +- *(1) Filter isnotnull(value1#19)
   :        +- HiveTableScan [key1#18, value1#19], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#18, value1#19]
   +- *(4) Sort [value3#21 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(value3#21, 200)
         +- *(3) Filter isnotnull(value3#21)
            +- HiveTableScan [key3#20, value3#21], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#20, value3#21]




    {code}






Appreciate to know your suggestion


Best Regards


Kelly Zhang














在 2019-10-23 22:12:35,"angers.zhu" <an...@gmail.com> 写道:



where not in ( query block)
condition will been change to LeftSemi join in optimizer rule RewritePredicateSubquery. 
Then as cloud-fan said,  it will be change to a BroadCastNestLoopJoin
| |
angers.zhu
|
|
angers.zhu@gmail.com
|
签名由网易邮箱大师定制


On 10/23/2019 20:55,Wenchen Fan<cl...@gmail.com> wrote:
I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens.


Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but currently Spark is only able to deal with it using broadcast. 


On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <ke...@126.com> wrote:

Hi all: 
i want to ask a question about broadcast nestloop join? from google i know, that 
 left outer/semi join and right outer/semi join will use broadcast nestloop.
  and in some cases, when the input data is very small, it is suitable to use. so here
  how to define the input data very small? what parameter decides the threshold?  I just want to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )




```


explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);


== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 = key3#92)))
   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
   +- BroadcastExchange IdentityBroadcastMode
      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]


```


  my question is 
  1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold= -1 
  2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.






Appreciate if you have suggestion




Best Regards


Kelly Zhang




 

Re: A question about broadcast nest loop join

Posted by "angers.zhu" <an...@gmail.com>.
  

**where not in ( query block)**

condition will been change to LeftSemi join in optimizer rule
**RewritePredicateSubquery**.  

Then as cloud-fan said,  it will be change to a BroadCastNestLoopJoin

[ ![](https://mail-online.nosdn.127.net/qiyelogo/defaultAvatar.png) |
angers.zhu  
---|---  
angers.zhu@gmail.com  
](https://maas.mail.163.com/dashi-web-
extend/html/proSignature.html?ftlId=1&name=angers.zhu&uid=angers.zhu%40gmail.com&iconUrl=https%3A%2F%2Fmail-
online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22angers.zhu%40gmail.com%22%5D)

签名由 [网易邮箱大师](https://mail.163.com/dashi/dlpro.html?from=mail81) 定制

  

On 10/23/2019 20:55,[Wenchen
Fan<cl...@gmail.com>](mailto:cloud0fan@gmail.com) wrote:

> I haven't looked into your query yet, just want to let you know that: Spark
can only pick BroadcastNestedLoopJoin to implement left/right join. If the
table is very big, then OOM happens.

>

>  
>

>

> Maybe there is an algorithm to implement left/right join in a distributed
environment without broadcast, but currently Spark is only able to deal with
it using broadcast.

>

>  
>

>

> On Wed, Oct 23, 2019 at 6:02 PM zhangliyun
<[kellyzly@126.com](mailto:kellyzly@126.com)> wrote:  
>

>

>> Hi all:

>>

>> i want to ask a question about broadcast nestloop join? from google i know,
that

>>

>>  left outer/semi join and right outer/semi join will use broadcast
nestloop.

>>

>>   and in some cases, when the input data is very small, it is suitable to
use. so here

>>

>>   how to define the input data very small? what parameter decides the
threshold?  I just want to disable it ( i found that   set
spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
from testdata1 as a where a.key1 not in (select key3 from testdata3) )

>>

>>  
>

>>

>>  
>

>>

>> ```

>>

>>  
>

>>

>> explain cost select a.key1  from testdata1 as a where a.key1 not in (select
key3 from testdata3);

>>

>>  
>

>>

>> == Physical Plan ==

>>

>> *(1) Project [key1#90]

>>

>> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
isnull((key1#90 = key3#92)))

>>

>>    :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]

>>

>>    +- BroadcastExchange IdentityBroadcastMode

>>

>>       +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]

>>

>>  
>

>>

>> ```

>>

>>  
>

>>

>>   my question is

>>

>>   1\. why in not in subquery , BroadcastNestedLoopJoin is still used even i
set spark.sql.autoBroadcastJoinThreshold= -1

>>

>>   2\. which spark parameter  decides enable/disable
BroadcastNestedLoopJoin.

>>

>>  
>

>>

>>  
>

>>

>>  
>

>>

>> Appreciate if you have suggestion

>>

>>  
>

>>

>>  
>

>>

>> Best Regards

>>

>>  
>

>>

>> Kelly Zhang

>>

>>  
>  
>

>>

>>  


Re: A question about broadcast nest loop join

Posted by Wenchen Fan <cl...@gmail.com>.
I haven't looked into your query yet, just want to let you know that: Spark
can only pick BroadcastNestedLoopJoin to implement left/right join. If the
table is very big, then OOM happens.

Maybe there is an algorithm to implement left/right join in a distributed
environment without broadcast, but currently Spark is only able to deal
with it using broadcast.

On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <ke...@126.com> wrote:

> Hi all:
> i want to ask a question about broadcast nestloop join? from google i
> know, that
>  left outer/semi join and right outer/semi join will use broadcast
> nestloop.
>   and in some cases, when the input data is very small, it is suitable to
> use. so here
>   how to define the input data very small? what parameter decides the
> threshold?  I just want to disable it ( i found that   set
> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>
>
> ```
>
> explain cost select a.key1  from testdata1 as a where a.key1 not in
> (select key3 from testdata3);
>
> == Physical Plan ==
> *(1) Project [key1#90]
> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
> isnull((key1#90 = key3#92)))
>    :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>    +- BroadcastExchange IdentityBroadcastMode
>       +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]
>
> ```
>
>   my question is
>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even i
> set spark.sql.autoBroadcastJoinThreshold= -1
>   2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.
>
>
>
> Appreciate if you have suggestion
>
>
> Best Regards
>
> Kelly Zhang
>
>
>
>

Re: A question about broadcast nest loop join

Posted by Wenchen Fan <cl...@gmail.com>.
I haven't looked into your query yet, just want to let you know that: Spark
can only pick BroadcastNestedLoopJoin to implement left/right join. If the
table is very big, then OOM happens.

Maybe there is an algorithm to implement left/right join in a distributed
environment without broadcast, but currently Spark is only able to deal
with it using broadcast.

On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <ke...@126.com> wrote:

> Hi all:
> i want to ask a question about broadcast nestloop join? from google i
> know, that
>  left outer/semi join and right outer/semi join will use broadcast
> nestloop.
>   and in some cases, when the input data is very small, it is suitable to
> use. so here
>   how to define the input data very small? what parameter decides the
> threshold?  I just want to disable it ( i found that   set
> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>
>
> ```
>
> explain cost select a.key1  from testdata1 as a where a.key1 not in
> (select key3 from testdata3);
>
> == Physical Plan ==
> *(1) Project [key1#90]
> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
> isnull((key1#90 = key3#92)))
>    :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>    +- BroadcastExchange IdentityBroadcastMode
>       +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]
>
> ```
>
>   my question is
>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even i
> set spark.sql.autoBroadcastJoinThreshold= -1
>   2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.
>
>
>
> Appreciate if you have suggestion
>
>
> Best Regards
>
> Kelly Zhang
>
>
>
>

Re: Re: A question about broadcast nest loop join

Posted by Wenchen Fan <cl...@gmail.com>.
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to
implement left/right join" this should be "left/right non-equal join"

On Thu, Oct 24, 2019 at 6:32 AM zhangliyun <ke...@126.com> wrote:

>
> Hi Herman:
>    I guess what you mentioned before
> ```
> if you are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> ```
> is the NULL key1 of  left table will be retained if NULL key2 is not found
> in the right table  ( join condition :  left.key1 = right.key2)  in exists
> semantics while this will not happen in
> "in semantics". If my understanding wrong, tell me.
>
>
>
> Best Regards.
>
> Kelly Zhang
>
>
>
>
>
>
>
>
> 在 2019-10-23 19:16:34,"Herman van Hovell" <he...@databricks.com> 写道:
>
> In some cases BroadcastNestedLoopJoin is the only viable join method. In
> your example for instance you are using a non-equi join condition and BNLJ
> is the only method that works in that case. This is also the reason why you
> can't disable it using the spark.sql.autoBroadcastJoinThreshold
> configuration.
>
> Such a plan is generally generated by using a NOT IN (subquery), if you
> are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> On Wed, Oct 23, 2019 at 12:02 PM zhangliyun <ke...@126.com> wrote:
>
>> Hi all:
>> i want to ask a question about broadcast nestloop join? from google i
>> know, that
>>  left outer/semi join and right outer/semi join will use broadcast
>> nestloop.
>>   and in some cases, when the input data is very small, it is suitable to
>> use. so here
>>   how to define the input data very small? what parameter decides the
>> threshold?  I just want to disable it ( i found that   set
>> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>>
>>
>> ```
>>
>> explain cost select a.key1  from testdata1 as a where a.key1 not in
>> (select key3 from testdata3);
>>
>> == Physical Plan ==
>> *(1) Project [key1#90]
>> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
>> isnull((key1#90 = key3#92)))
>>    :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>>    +- BroadcastExchange IdentityBroadcastMode
>>       +- HiveTableScan [key3#92], HiveTableRelation
>> `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [key3#92, value3#93]
>>
>> ```
>>
>>   my question is
>>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even
>> i set spark.sql.autoBroadcastJoinThreshold= -1
>>   2. which spark parameter  decides enable/disable
>> BroadcastNestedLoopJoin.
>>
>>
>>
>> Appreciate if you have suggestion
>>
>>
>> Best Regards
>>
>> Kelly Zhang
>>
>>
>>
>>
>
>
>
>

Re: Re: A question about broadcast nest loop join

Posted by Wenchen Fan <cl...@gmail.com>.
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to
implement left/right join" this should be "left/right non-equal join"

On Thu, Oct 24, 2019 at 6:32 AM zhangliyun <ke...@126.com> wrote:

>
> Hi Herman:
>    I guess what you mentioned before
> ```
> if you are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> ```
> is the NULL key1 of  left table will be retained if NULL key2 is not found
> in the right table  ( join condition :  left.key1 = right.key2)  in exists
> semantics while this will not happen in
> "in semantics". If my understanding wrong, tell me.
>
>
>
> Best Regards.
>
> Kelly Zhang
>
>
>
>
>
>
>
>
> 在 2019-10-23 19:16:34,"Herman van Hovell" <he...@databricks.com> 写道:
>
> In some cases BroadcastNestedLoopJoin is the only viable join method. In
> your example for instance you are using a non-equi join condition and BNLJ
> is the only method that works in that case. This is also the reason why you
> can't disable it using the spark.sql.autoBroadcastJoinThreshold
> configuration.
>
> Such a plan is generally generated by using a NOT IN (subquery), if you
> are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> On Wed, Oct 23, 2019 at 12:02 PM zhangliyun <ke...@126.com> wrote:
>
>> Hi all:
>> i want to ask a question about broadcast nestloop join? from google i
>> know, that
>>  left outer/semi join and right outer/semi join will use broadcast
>> nestloop.
>>   and in some cases, when the input data is very small, it is suitable to
>> use. so here
>>   how to define the input data very small? what parameter decides the
>> threshold?  I just want to disable it ( i found that   set
>> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>>
>>
>> ```
>>
>> explain cost select a.key1  from testdata1 as a where a.key1 not in
>> (select key3 from testdata3);
>>
>> == Physical Plan ==
>> *(1) Project [key1#90]
>> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
>> isnull((key1#90 = key3#92)))
>>    :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>>    +- BroadcastExchange IdentityBroadcastMode
>>       +- HiveTableScan [key3#92], HiveTableRelation
>> `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [key3#92, value3#93]
>>
>> ```
>>
>>   my question is
>>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even
>> i set spark.sql.autoBroadcastJoinThreshold= -1
>>   2. which spark parameter  decides enable/disable
>> BroadcastNestedLoopJoin.
>>
>>
>>
>> Appreciate if you have suggestion
>>
>>
>> Best Regards
>>
>> Kelly Zhang
>>
>>
>>
>>
>
>
>
>

Re:Re: A question about broadcast nest loop join

Posted by zhangliyun <ke...@126.com>.

Hi Herman:
   I guess what you mentioned before 
```
if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform a lot better.


```
is the NULL key1 of  left table will be retained if NULL key2 is not found in the right table  ( join condition :  left.key1 = right.key2)  in exists semantics while this will not happen in 
"in semantics". If my understanding wrong, tell me.






Best Regards.


Kelly Zhang
















在 2019-10-23 19:16:34,"Herman van Hovell" <he...@databricks.com> 写道:

In some cases BroadcastNestedLoopJoin is the only viable join method. In your example for instance you are using a non-equi join condition and BNLJ is the only method that works in that case. This is also the reason why you can't disable it using the spark.sql.autoBroadcastJoinThreshold configuration.


Such a plan is generally generated by using a NOT IN (subquery), if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform a lot better.


On Wed, Oct 23, 2019 at 12:02 PM zhangliyun <ke...@126.com> wrote:

Hi all: 
i want to ask a question about broadcast nestloop join? from google i know, that 
 left outer/semi join and right outer/semi join will use broadcast nestloop.
  and in some cases, when the input data is very small, it is suitable to use. so here
  how to define the input data very small? what parameter decides the threshold?  I just want to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )




```


explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);


== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 = key3#92)))
   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
   +- BroadcastExchange IdentityBroadcastMode
      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]


```


  my question is 
  1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold= -1 
  2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.






Appreciate if you have suggestion




Best Regards


Kelly Zhang




 

Re:Re: A question about broadcast nest loop join

Posted by zhangliyun <ke...@126.com>.

Hi Herman:
   I guess what you mentioned before 
```
if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform a lot better.


```
is the NULL key1 of  left table will be retained if NULL key2 is not found in the right table  ( join condition :  left.key1 = right.key2)  in exists semantics while this will not happen in 
"in semantics". If my understanding wrong, tell me.






Best Regards.


Kelly Zhang
















在 2019-10-23 19:16:34,"Herman van Hovell" <he...@databricks.com> 写道:

In some cases BroadcastNestedLoopJoin is the only viable join method. In your example for instance you are using a non-equi join condition and BNLJ is the only method that works in that case. This is also the reason why you can't disable it using the spark.sql.autoBroadcastJoinThreshold configuration.


Such a plan is generally generated by using a NOT IN (subquery), if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform a lot better.


On Wed, Oct 23, 2019 at 12:02 PM zhangliyun <ke...@126.com> wrote:

Hi all: 
i want to ask a question about broadcast nestloop join? from google i know, that 
 left outer/semi join and right outer/semi join will use broadcast nestloop.
  and in some cases, when the input data is very small, it is suitable to use. so here
  how to define the input data very small? what parameter decides the threshold?  I just want to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )




```


explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);


== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 = key3#92)))
   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
   +- BroadcastExchange IdentityBroadcastMode
      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]


```


  my question is 
  1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold= -1 
  2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.






Appreciate if you have suggestion




Best Regards


Kelly Zhang