You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Wenchen Fan <cl...@gmail.com> on 2019/11/10 09:57:32 UTC

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

By default sort merge join is preferred over shuffle hash join, that's why
we haven't spend resources to implement codegen for it.

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <gw...@ebay.com.invalid> wrote:

> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that ShuffledHashJoinExec does not implement
> CodegenSupport, is there any concern? And if there is any chance to improve
> the performance of ShuffledHashJoinExec?
>
>
>

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

Posted by Wenchen Fan <cl...@gmail.com>.

Yea codegen can be a good improvement, PRs are welcome!

On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang <gw...@ebay.com> wrote:

> That’s right. By default, Spark prefers sort merge join.
>
> While, in our product environment, there are many huge bucket tables. We
> can leverage the bucketing to avoid shuffle when join with other small
> tables (the small tables are not small enough to leverage broad cast join).
> Problem is that, although shuffle can be avoid, sort is still necessary to
> leverage sort merge join (we cannot pre-sort since there are different join
> patterns). For a huge table, sort may take even tens of seconds.
>
> That’s why I’m trying to enable shuffle hash join, and for such cases,
> there were almost 10% ~ 20% improvement when apply shuffle hash join
> instead of sort merge join. I wonder if there is still some space to
> improve shuffle hash join? Like code generation for ShuffledHashJoinExec or
> something….
>
>
>
> *From: *Wenchen Fan <cl...@gmail.com>
> *Date: *Sunday, November 10, 2019 at 5:57 PM
> *To: *"Wang, Gang" <gw...@ebay.com.invalid>
> *Cc: *"dev@spark.apache.org" <de...@spark.apache.org>
> *Subject: *Re: Why not implement CodegenSupport in class
> ShuffledHashJoinExec?
>
>
>
> By default sort merge join is preferred over shuffle hash join, that's why
> we haven't spend resources to implement codegen for it.
>
>
>
> On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <gw...@ebay.com.invalid>
> wrote:
>
> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that ShuffledHashJoinExec does not implement
> CodegenSupport, is there any concern? And if there is any chance to improve
> the performance of ShuffledHashJoinExec?
>
>
>
>

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

Posted by "Wang, Gang" <gw...@ebay.com.INVALID>.

That’s right. By default, Spark prefers sort merge join.
While, in our product environment, there are many huge bucket tables. We can leverage the bucketing to avoid shuffle when join with other small tables (the small tables are not small enough to leverage broad cast join). Problem is that, although shuffle can be avoid, sort is still necessary to leverage sort merge join (we cannot pre-sort since there are different join patterns). For a huge table, sort may take even tens of seconds.
That’s why I’m trying to enable shuffle hash join, and for such cases, there were almost 10% ~ 20% improvement when apply shuffle hash join instead of sort merge join. I wonder if there is still some space to improve shuffle hash join? Like code generation for ShuffledHashJoinExec or something….

From: Wenchen Fan <cl...@gmail.com>
Date: Sunday, November 10, 2019 at 5:57 PM
To: "Wang, Gang" <gw...@ebay.com.invalid>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>
Subject: Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it.

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <gw...@ebay.com.invalid> wrote:

There are some cases, shuffle hash join performs even better than sort merge join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, is there any concern? And if there is any chance to improve the performance of ShuffledHashJoinExec?