You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jone Zhang <jo...@gmail.com> on 2017/05/10 11:10:32 UTC

Why spark.sql.autoBroadcastJoinThreshold not available

Now i use spark1.6.0 in java
I wish the following sql to be executed in BroadcastJoin way
*select * from sample join feature*

This is my step
1.set spark.sql.autoBroadcastJoinThreshold=100M
2.HiveContext.sql("cache lazy table feature as "select * from src where
...) which result size is only 100K
3.HiveContext.sql("select * from sample join feature")
Why the join is SortMergeJoin?

Grateful for any idea!
Thanks.

Fwd: Why spark.sql.autoBroadcastJoinThreshold not available

Posted by Jone Zhang <jo...@gmail.com>.

Solve it by remove lazy identity.
2.HiveContext.sql("cache table feature as "select * from src where ...)
which result size is only 100K

---------- Forwarded message ----------
From: Jone Zhang <jo...@gmail.com>
Date: 2017-05-10 19:10 GMT+08:00
Subject: Why spark.sql.autoBroadcastJoinThreshold not available
To: "user @spark/'user @spark'/spark users/user@spark" <
user@spark.apache.org>

Now i use spark1.6.0 in java
I wish the following sql to be executed in BroadcastJoin way
*select * from sample join feature*

This is my step
1.set spark.sql.autoBroadcastJoinThreshold=100M
2.HiveContext.sql("cache lazy table feature as "select * from src where
...) which result size is only 100K
3.HiveContext.sql("select * from sample join feature")
Why the join is SortMergeJoin?

Grateful for any idea!
Thanks.

Re: Why spark.sql.autoBroadcastJoinThreshold not available

Posted by Jone Zhang <jo...@gmail.com>.

Solve it by remove lazy identity.
2.HiveContext.sql("cache table feature as "select * from src where ...)
which result size is only 100K

Thanks!

2017-05-15 21:26 GMT+08:00 Yong Zhang <ja...@hotmail.com>:

> You should post the execution plan here, so we can provide more accurate
> support.
>
>
> Since in your feature table, you are building it with projection ("where
> ...."), so my guess is that the following JIRA (SPARK-13383
> <https://issues.apache.org/jira/browse/SPARK-13383>) stops the broadcast
> join. This is fixed in the Spark 2.x. Can you try it on Spark 2.0?
>
> Yong
>
> ------------------------------
> *From:* Jone Zhang <jo...@gmail.com>
> *Sent:* Wednesday, May 10, 2017 7:10 AM
> *To:* user @spark/'user @spark'/spark users/user@spark
> *Subject:* Why spark.sql.autoBroadcastJoinThreshold not available
>
> Now i use spark1.6.0 in java
> I wish the following sql to be executed in BroadcastJoin way
> *select * from sample join feature*
>
> This is my step
> 1.set spark.sql.autoBroadcastJoinThreshold=100M
> 2.HiveContext.sql("cache lazy table feature as "select * from src where
> ...) which result size is only 100K
> 3.HiveContext.sql("select * from sample join feature")
> Why the join is SortMergeJoin?
>
> Grateful for any idea!
> Thanks.
>

Re: Why spark.sql.autoBroadcastJoinThreshold not available

Posted by Yong Zhang <ja...@hotmail.com>.

You should post the execution plan here, so we can provide more accurate support.


Since in your feature table, you are building it with projection ("where ...."), so my guess is that the following JIRA (SPARK-13383<https://issues.apache.org/jira/browse/SPARK-13383>) stops the broadcast join. This is fixed in the Spark 2.x. Can you try it on Spark 2.0?

Yong

________________________________
From: Jone Zhang <jo...@gmail.com>
Sent: Wednesday, May 10, 2017 7:10 AM
To: user @spark/'user @spark'/spark users/user@spark
Subject: Why spark.sql.autoBroadcastJoinThreshold not available

Now i use spark1.6.0 in java
I wish the following sql to be executed in BroadcastJoin way
select * from sample join feature

This is my step
1.set spark.sql.autoBroadcastJoinThreshold=100M
2.HiveContext.sql("cache lazy table feature as "select * from src where ...) which result size is only 100K
3.HiveContext.sql("select * from sample join feature")
Why the join is SortMergeJoin?

Grateful for any idea!
Thanks.