You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Haopu Wang <HW...@qilinsoft.com> on 2014/10/08 08:04:55 UTC

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Liquan, yes, for full outer join, one hash table on both sides is more efficient.

 

For the left/right outer join, it looks like one hash table should be enought.

 

________________________________

From: Liquan Pei [mailto:liquanpei@gmail.com] 
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc: dev@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

 

Hi Haopu,

 

How about full outer join? One hash table may not be efficient for this case. 

 

Liquan

 

On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <HW...@qilinsoft.com> wrote:

Hi, Liquan, thanks for the response.

 

In your example, I think the hash table should be built on the "right" side, so Spark can iterate through the left side and find matches in the right side from the hash table efficiently. Please comment and suggest, thanks again!

 

________________________________

From: Liquan Pei [mailto:liquanpei@gmail.com] 
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: dev@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

 

Hi Haopu,

 

My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is needed to make sure that no matching tuples for that row on left side. 

 

Hope this helps!

Liquan

 

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <HW...@qilinsoft.com> wrote:

I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst 





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Posted by Liquan Pei <li...@gmail.com>.

I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.

Thanks,
Liquan

On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia <ma...@gmail.com>
wrote:

> I'm pretty sure inner joins on Spark SQL already build only one of the
> sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators.
> Only outer joins do both, and it seems like we could optimize it for those
> that are not full.
>
> Matei
>
>
>
> On Oct 7, 2014, at 11:04 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
> Liquan, yes, for full outer join, one hash table on both sides is more
> efficient.
>
> For the left/right outer join, it looks like one hash table should be
> enought.
>
> ------------------------------
> *From:* Liquan Pei [mailto:liquanpei@gmail.com <li...@gmail.com>]
> *Sent:* 2014年9月30日 18:34
> *To:* Haopu Wang
> *Cc:* dev@spark.apache.org; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> How about full outer join? One hash table may not be efficient for this
> case.
>
> Liquan
>
> On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> Hi, Liquan, thanks for the response.
>
> In your example, I think the hash table should be built on the "right"
> side, so Spark can iterate through the left side and find matches in the
> right side from the hash table efficiently. Please comment and suggest,
> thanks again!
>
> ------------------------------
> *From:* Liquan Pei [mailto:liquanpei@gmail.com]
> *Sent:* 2014年9月30日 12:31
> *To:* Haopu Wang
> *Cc:* dev@spark.apache.org; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> My understanding is that the hashtable on both left and right side is used
> for including null values in result in an efficient manner. If hash table
> is only built on one side, let's say left side and we perform a left outer
> join, for each row in left side, a scan over the right side is needed to
> make sure that no matching tuples for that row on left side.
>
> Hope this helps!
> Liquan
>
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
>
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Posted by Liquan Pei <li...@gmail.com>.

I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.

Thanks,
Liquan

On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia <ma...@gmail.com>
wrote:

> I'm pretty sure inner joins on Spark SQL already build only one of the
> sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators.
> Only outer joins do both, and it seems like we could optimize it for those
> that are not full.
>
> Matei
>
>
>
> On Oct 7, 2014, at 11:04 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
> Liquan, yes, for full outer join, one hash table on both sides is more
> efficient.
>
> For the left/right outer join, it looks like one hash table should be
> enought.
>
> ------------------------------
> *From:* Liquan Pei [mailto:liquanpei@gmail.com <li...@gmail.com>]
> *Sent:* 2014年9月30日 18:34
> *To:* Haopu Wang
> *Cc:* dev@spark.apache.org; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> How about full outer join? One hash table may not be efficient for this
> case.
>
> Liquan
>
> On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> Hi, Liquan, thanks for the response.
>
> In your example, I think the hash table should be built on the "right"
> side, so Spark can iterate through the left side and find matches in the
> right side from the hash table efficiently. Please comment and suggest,
> thanks again!
>
> ------------------------------
> *From:* Liquan Pei [mailto:liquanpei@gmail.com]
> *Sent:* 2014年9月30日 12:31
> *To:* Haopu Wang
> *Cc:* dev@spark.apache.org; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> My understanding is that the hashtable on both left and right side is used
> for including null values in result in an efficient manner. If hash table
> is only built on one side, let's say left side and we perform a left outer
> join, for each row in left side, a scan over the right side is needed to
> make sure that no matching tuples for that row on left side.
>
> Hope this helps!
> Liquan
>
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
>
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Posted by Matei Zaharia <ma...@gmail.com>.

I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full.

Matei


On Oct 7, 2014, at 11:04 PM, Haopu Wang <HW...@qilinsoft.com> wrote:

> Liquan, yes, for full outer join, one hash table on both sides is more efficient.
>  
> For the left/right outer join, it looks like one hash table should be enought.
>  
> From: Liquan Pei [mailto:liquanpei@gmail.com] 
> Sent: 2014年9月30日 18:34
> To: Haopu Wang
> Cc: dev@spark.apache.org; user
> Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?
>  
> Hi Haopu,
>  
> How about full outer join? One hash table may not be efficient for this case. 
>  
> Liquan
>  
> On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> Hi, Liquan, thanks for the response.
>  
> In your example, I think the hash table should be built on the "right" side, so Spark can iterate through the left side and find matches in the right side from the hash table efficiently. Please comment and suggest, thanks again!
>  
> From: Liquan Pei [mailto:liquanpei@gmail.com] 
> Sent: 2014年9月30日 12:31
> To: Haopu Wang
> Cc: dev@spark.apache.org; user
> Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?
>  
> Hi Haopu,
>  
> My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is needed to make sure that no matching tuples for that row on left side. 
>  
> Hope this helps!
> Liquan
>  
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
> 
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
> 
> Thanks!
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 
> 
>  
> -- 
> Liquan Pei 
> Department of Physics 
> University of Massachusetts Amherst
> 
> 
>  
> -- 
> Liquan Pei 
> Department of Physics 
> University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Posted by Matei Zaharia <ma...@gmail.com>.

I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full.

Matei


On Oct 7, 2014, at 11:04 PM, Haopu Wang <HW...@qilinsoft.com> wrote:

> Liquan, yes, for full outer join, one hash table on both sides is more efficient.
>  
> For the left/right outer join, it looks like one hash table should be enought.
>  
> From: Liquan Pei [mailto:liquanpei@gmail.com] 
> Sent: 2014年9月30日 18:34
> To: Haopu Wang
> Cc: dev@spark.apache.org; user
> Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?
>  
> Hi Haopu,
>  
> How about full outer join? One hash table may not be efficient for this case. 
>  
> Liquan
>  
> On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> Hi, Liquan, thanks for the response.
>  
> In your example, I think the hash table should be built on the "right" side, so Spark can iterate through the left side and find matches in the right side from the hash table efficiently. Please comment and suggest, thanks again!
>  
> From: Liquan Pei [mailto:liquanpei@gmail.com] 
> Sent: 2014年9月30日 12:31
> To: Haopu Wang
> Cc: dev@spark.apache.org; user
> Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?
>  
> Hi Haopu,
>  
> My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is needed to make sure that no matching tuples for that row on left side. 
>  
> Hope this helps!
> Liquan
>  
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
> 
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
> 
> Thanks!
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 
> 
>  
> -- 
> Liquan Pei 
> Department of Physics 
> University of Massachusetts Amherst
> 
> 
>  
> -- 
> Liquan Pei 
> Department of Physics 
> University of Massachusetts Amherst