You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by paleyl <pa...@gmail.com> on 2017/06/29 02:42:43 UTC

about broadcast join of base table in spark sql

Hi All,

Recently I meet a problem in broadcast join: I want to left join table A
and B, A is the smaller one and the left table, so I wrote
A = A.join(B,A("key1") === B("key2"),"left")
but I found that A is not broadcast out, as the shuffle size is still very
large.
I guess this is a designed mechanism in spark, so could anyone please tell
me why it is designed like this? I am just very curious.

Best,

Paley

Re: about broadcast join of base table in spark sql

Posted by Paley Louie <pa...@gmail.com>.
Thank you for your reply, I have tried to set parameter spark.sql.autoBroadcastJoinThreshold to high enough value, however it does not work,  I think broadcast of base table is disabled in spark.


> On Jun 30, 2017, at 6:57 PM, Bryan Jeffrey <br...@gmail.com> wrote:
> 
> Hello. 
> 
> If you want to allow broadcast join with larger broadcasts you can set spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the plan to allow join despite 'A' being larger than the default threshold. 
> 
> Get Outlook for Android <https://aka.ms/ghei36>
> 
> 
> From: paleyl
> Sent: Wednesday, June 28, 10:42 PM
> Subject: about broadcast join of base table in spark sql
> To: dev@spark.org, user@spark.apache.org
> 
> 
> Hi All,
> 
> 
> Recently I meet a problem in broadcast join: I want to left join table A and B, A is the smaller one and the left table, so I wrote 
> 
> A = A.join(B,A("key1") === B("key2"),"left")
> 
> but I found that A is not broadcast out, as the shuffle size is still very large.
> 
> I guess this is a designed mechanism in spark, so could anyone please tell me why it is designed like this? I am just very curious.
> 
> 
> Best,
> 
> 
> Paley 
> 
> 
> 


Re: about broadcast join of base table in spark sql

Posted by Yong Zhang <ja...@hotmail.com>.
Then you need to tell us the spark version, and post the execution plan here, so we can help you better.


Yong


________________________________
From: Paley Louie <pa...@gmail.com>
Sent: Sunday, July 2, 2017 12:36 AM
To: Yong Zhang
Cc: Bryan Jeffrey; dev@spark.org; user@spark.apache.org
Subject: Re: about broadcast join of base table in spark sql

Thank you for your reply, I have tried to add broadcast hint to the base table, but it just cannot be broadcast out.
On Jun 30, 2017, at 9:13 PM, Yong Zhang <ja...@hotmail.com>> wrote:

Or since you already use the DataFrame API, instead of SQL, you can add the broadcast function to force it.

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)

Yong
functions - Apache Spark<https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
spark.apache.org<http://spark.apache.org/>
Computes the numeric value of the first character of the string column, and returns the result as a int column.





________________________________
From: Bryan Jeffrey <br...@gmail.com>>
Sent: Friday, June 30, 2017 6:57 AM
To: dev@spark.org<ma...@spark.org>; user@spark.apache.org<ma...@spark.apache.org>; paleyl
Subject: Re: about broadcast join of base table in spark sql

Hello.

If you want to allow broadcast join with larger broadcasts you can set spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the plan to allow join despite 'A' being larger than the default threshold.

Get Outlook for Android<https://aka.ms/ghei36>



From: paleyl
Sent: Wednesday, June 28, 10:42 PM
Subject: about broadcast join of base table in spark sql
To: dev@spark.org<ma...@spark.org>, user@spark.apache.org<ma...@spark.apache.org>


Hi All,


Recently I meet a problem in broadcast join: I want to left join table A and B, A is the smaller one and the left table, so I wrote

A = A.join(B,A("key1") === B("key2"),"left")

but I found that A is not broadcast out, as the shuffle size is still very large.

I guess this is a designed mechanism in spark, so could anyone please tell me why it is designed like this? I am just very curious.


Best,


Paley


Re: about broadcast join of base table in spark sql

Posted by paleyl <pa...@gmail.com>.
Thank you for your reply, but when I remove the left join option(like A =
A.join(B,A("key1") === B("key2"))), it can be broadcast out. there is no
reason spark cannot get table size when left join option is chosen on.


On Sun, Jul 2, 2017 at 1:55 PM, Xiaoye Sun <su...@gmail.com> wrote:

> you may need to check if spark can get the size of your table. If spark
> cannot get the table size, it won't do broadcast.
>
> On Sat, Jul 1, 2017 at 11:37 PM Paley Louie <pa...@gmail.com> wrote:
>
>> Thank you for your reply, I have tried to add broadcast hint to the base
>> table, but it just cannot be broadcast out.
>>
>> On Jun 30, 2017, at 9:13 PM, Yong Zhang <ja...@hotmail.com> wrote:
>>
>> Or since you already use the DataFrame API, instead of SQL, you can add
>> the broadcast function to force it.
>>
>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
>> spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)
>>
>> Yong
>> functions - Apache Spark
>> <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
>> spark.apache.org
>> Computes the numeric value of the first character of the string column,
>> and returns the result as a int column.
>>
>>
>>
>>
>> ------------------------------
>> *From:* Bryan Jeffrey <br...@gmail.com>
>> *Sent:* Friday, June 30, 2017 6:57 AM
>> *To:* dev@spark.org; user@spark.apache.org; paleyl
>> *Subject:* Re: about broadcast join of base table in spark sql
>>
>> Hello.
>>
>> If you want to allow broadcast join with larger broadcasts you can set
>> spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause
>> the plan to allow join despite 'A' being larger than the default threshold.
>>
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> From: paleyl
>> Sent: Wednesday, June 28, 10:42 PM
>> Subject: about broadcast join of base table in spark sql
>> To: dev@spark.org, user@spark.apache.org
>>
>>
>> Hi All,
>>
>>
>> Recently I meet a problem in broadcast join: I want to left join table A
>> and B, A is the smaller one and the left table, so I wrote
>>
>> A = A.join(B,A("key1") === B("key2"),"left")
>>
>> but I found that A is not broadcast out, as the shuffle size is still
>> very large.
>>
>> I guess this is a designed mechanism in spark, so could anyone please
>> tell me why it is designed like this? I am just very curious.
>>
>>
>> Best,
>>
>>
>> Paley
>>
>>
>>


-- 
Peili Lv

Department of Pattern Recognition and Intelligent System
School of Automation
Northwestern Polytechnical University
NPU Chang'an Campus
Building of Automation #130
http://paley.mydiscussion.net/

Re: about broadcast join of base table in spark sql

Posted by Xiaoye Sun <su...@gmail.com>.
you may need to check if spark can get the size of your table. If spark
cannot get the table size, it won't do broadcast.

On Sat, Jul 1, 2017 at 11:37 PM Paley Louie <pa...@gmail.com> wrote:

> Thank you for your reply, I have tried to add broadcast hint to the base
> table, but it just cannot be broadcast out.
>
> On Jun 30, 2017, at 9:13 PM, Yong Zhang <ja...@hotmail.com> wrote:
>
> Or since you already use the DataFrame API, instead of SQL, you can add
> the broadcast function to force it.
>
>
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)
>
> Yong
> functions - Apache Spark
> <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
> spark.apache.org
> Computes the numeric value of the first character of the string column,
> and returns the result as a int column.
>
>
>
>
> ------------------------------
> *From:* Bryan Jeffrey <br...@gmail.com>
> *Sent:* Friday, June 30, 2017 6:57 AM
> *To:* dev@spark.org; user@spark.apache.org; paleyl
> *Subject:* Re: about broadcast join of base table in spark sql
>
> Hello.
>
> If you want to allow broadcast join with larger broadcasts you can set
> spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the
> plan to allow join despite 'A' being larger than the default threshold.
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
> From: paleyl
> Sent: Wednesday, June 28, 10:42 PM
> Subject: about broadcast join of base table in spark sql
> To: dev@spark.org, user@spark.apache.org
>
>
> Hi All,
>
>
> Recently I meet a problem in broadcast join: I want to left join table A
> and B, A is the smaller one and the left table, so I wrote
>
> A = A.join(B,A("key1") === B("key2"),"left")
>
> but I found that A is not broadcast out, as the shuffle size is still very
> large.
>
> I guess this is a designed mechanism in spark, so could anyone please tell
> me why it is designed like this? I am just very curious.
>
>
> Best,
>
>
> Paley
>
>
>

Re: about broadcast join of base table in spark sql

Posted by Paley Louie <pa...@gmail.com>.
Thank you for your reply, I have tried to add broadcast hint to the base table, but it just cannot be broadcast out.
> On Jun 30, 2017, at 9:13 PM, Yong Zhang <ja...@hotmail.com> wrote:
> 
> Or since you already use the DataFrame API, instead of SQL, you can add the broadcast function to force it.
> 
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame) <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
> 
> Yong
> functions - Apache Spark <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
> spark.apache.org <http://spark.apache.org/>
> Computes the numeric value of the first character of the string column, and returns the result as a int column.
> 
> 
> 
> 
> From: Bryan Jeffrey <br...@gmail.com>
> Sent: Friday, June 30, 2017 6:57 AM
> To: dev@spark.org; user@spark.apache.org; paleyl
> Subject: Re: about broadcast join of base table in spark sql
>  
> Hello. 
> 
> If you want to allow broadcast join with larger broadcasts you can set spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the plan to allow join despite 'A' being larger than the default threshold. 
> 
> Get Outlook for Android <https://aka.ms/ghei36>
> 
> 
> From: paleyl
> Sent: Wednesday, June 28, 10:42 PM
> Subject: about broadcast join of base table in spark sql
> To: dev@spark.org, user@spark.apache.org
> 
> 
> Hi All,
> 
> 
> Recently I meet a problem in broadcast join: I want to left join table A and B, A is the smaller one and the left table, so I wrote 
> 
> A = A.join(B,A("key1") === B("key2"),"left")
> 
> but I found that A is not broadcast out, as the shuffle size is still very large.
> 
> I guess this is a designed mechanism in spark, so could anyone please tell me why it is designed like this? I am just very curious.
> 
> 
> Best,
> 
> 
> Paley 


Re: about broadcast join of base table in spark sql

Posted by Yong Zhang <ja...@hotmail.com>.
Or since you already use the DataFrame API, instead of SQL, you can add the broadcast function to force it.


https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)


Yong

functions - Apache Spark<https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
spark.apache.org
Computes the numeric value of the first character of the string column, and returns the result as a int column.





________________________________
From: Bryan Jeffrey <br...@gmail.com>
Sent: Friday, June 30, 2017 6:57 AM
To: dev@spark.org; user@spark.apache.org; paleyl
Subject: Re: about broadcast join of base table in spark sql

Hello.

If you want to allow broadcast join with larger broadcasts you can set spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the plan to allow join despite 'A' being larger than the default threshold.

Get Outlook for Android<https://aka.ms/ghei36>



From: paleyl
Sent: Wednesday, June 28, 10:42 PM
Subject: about broadcast join of base table in spark sql
To: dev@spark.org, user@spark.apache.org


Hi All,


Recently I meet a problem in broadcast join: I want to left join table A and B, A is the smaller one and the left table, so I wrote

A = A.join(B,A("key1") === B("key2"),"left")

but I found that A is not broadcast out, as the shuffle size is still very large.

I guess this is a designed mechanism in spark, so could anyone please tell me why it is designed like this? I am just very curious.


Best,


Paley




Re: about broadcast join of base table in spark sql

Posted by Bryan Jeffrey <br...@gmail.com>.
Hello. 




If you want to allow broadcast join with larger broadcasts you can set spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the plan to allow join despite 'A' being larger than the default threshold. 




Get Outlook for Android







From: paleyl


Sent: Wednesday, June 28, 10:42 PM


Subject: about broadcast join of base table in spark sql


To: dev@spark.org, user@spark.apache.org






Hi All,






Recently I meet a problem in broadcast join: I want to left join table A and B, A is the smaller one and the left table, so I wrote 




A = A.join(B,A("key1") === B("key2"),"left")




but I found that A is not broadcast out, as the shuffle size is still very large.




I guess this is a designed mechanism in spark, so could anyone please tell me why it is designed like this? I am just very curious.






Best,






Paley 









Fwd: about broadcast join of base table in spark sql

Posted by paleyl <pa...@gmail.com>.
Hi All,

Recently I meet a problem in broadcast join: I want to left join table A
and B, A is the smaller one and the left table, so I wrote
A = A.join(B,A("key1") === B("key2"),"left")
but I found that A is not broadcast out, as the shuffle size is still very
large.
I guess this is a designed mechanism in spark, so could anyone please tell
me why it is designed like this? I am just very curious.

Best,

Paley