You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hongdi Ren <ry...@gmail.com> on 2017/02/22 06:32:04 UTC

spark sql: full outer join optimization

Hi all,

 

On spark 1.6, I’ve written a full outer join on no key which causes a Cartesian product. The left side is about 10 thousand rows while the right side is about 80 million.

 

My question is, why not spark sql optimize it to a mapper join (broadcast the small table) automatically? Is there any parameter/hint I can set rather than manually write broadcast code?

 

Sql example: 

  select * from (select id from a) user full outer join (select id from b) item

  (as the name implies, I’m doing a collaborative filtering with implicit feedback, thus the whole user-item matrix is needed)

 

Thanks for any help.

 

 

Attach the query plan:

Re: spark sql: full outer join optimization

Posted by 任弘迪 <ry...@gmail.com>.

Thanks yong.

On Wed, Feb 22, 2017 at 9:45 PM, Yong Zhang <ja...@hotmail.com> wrote:

> In Spark SQL, the broadcast join is triggered by "spark.sql.conf.
> autoBroadcastJoinThreshold", check what is your spark session.
>
>
> The more important point is that Spark normally uses the statistics of the
> table, and if the table size is less than above setting, then broadcast
> join will be considered. So make sure you update your statistics in your
> table.
>
>
> Another limitation is for sub query. If you use sub query, it is very hard
> for Catalyst to understand what is the estimate size of your sub query.
> Spark 2.0 just start to implement some CBO logic, but it is still in very
> early stage.
>
>
> If your case, if you are confident that Broadcast join is the way to go,
> you can use the Broadcast function call in DF, to force the broadcast join.
>
>
> Yong
>
> ------------------------------
> *From:* Hongdi Ren <ry...@gmail.com>
> *Sent:* Wednesday, February 22, 2017 1:32 AM
> *To:* user@spark.apache.org
> *Subject:* spark sql: full outer join optimization
>
>
> Hi all,
>
>
>
> On spark 1.6, I’ve written a full outer join on no key which causes a
> Cartesian product. The left side is about 10 thousand rows while the right
> side is about 80 million.
>
>
>
> My question is, why not spark sql optimize it to a mapper join (broadcast
> the small table) automatically? Is there any parameter/hint I can set
> rather than manually write broadcast code?
>
>
>
> Sql example:
>
>   select * from (select id from a) user full outer join (select id from b)
> item
>
>   (as the name implies, I’m doing a collaborative filtering with implicit
> feedback, thus the whole user-item matrix is needed)
>
>
>
> Thanks for any help.
>
>
>
>
>
> Attach the query plan:
>
>
>
>

Re: spark sql: full outer join optimization

Posted by Yong Zhang <ja...@hotmail.com>.

In Spark SQL, the broadcast join is triggered by "spark.sql.conf.autoBroadcastJoinThreshold", check what is your spark session.


The more important point is that Spark normally uses the statistics of the table, and if the table size is less than above setting, then broadcast join will be considered. So make sure you update your statistics in your table.


Another limitation is for sub query. If you use sub query, it is very hard for Catalyst to understand what is the estimate size of your sub query. Spark 2.0 just start to implement some CBO logic, but it is still in very early stage.


If your case, if you are confident that Broadcast join is the way to go, you can use the Broadcast function call in DF, to force the broadcast join.


Yong

________________________________
From: Hongdi Ren <ry...@gmail.com>
Sent: Wednesday, February 22, 2017 1:32 AM
To: user@spark.apache.org
Subject: spark sql: full outer join optimization


Hi all,



On spark 1.6, I’ve written a full outer join on no key which causes a Cartesian product. The left side is about 10 thousand rows while the right side is about 80 million.



My question is, why not spark sql optimize it to a mapper join (broadcast the small table) automatically? Is there any parameter/hint I can set rather than manually write broadcast code?



Sql example:

  select * from (select id from a) user full outer join (select id from b) item

  (as the name implies, I’m doing a collaborative filtering with implicit feedback, thus the whole user-item matrix is needed)



Thanks for any help.





Attach the query plan:



[cid:image001.png@01D28D18.6FBF8680]