You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2018/11/05 19:42:00 UTC

[jira] [Comment Edited] (HIVE-20867) Rewrite INTERSECT into LEFT SEMI JOIN instead of UNION + Group by

    [ https://issues.apache.org/jira/browse/HIVE-20867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675678#comment-16675678 ] 

Gopal V edited comment on HIVE-20867 at 11/5/18 7:41 PM:
---------------------------------------------------------

bq. Comparing with left-semi join one, we need to do the join one by one.

That is true at the logical level, but the physical runtime uses the bloom filter semi-join before data-gets shuffled & the join operations actually do not force a full sort (while the group-by does).

The current implementation's achilles heel is when the two branches are unequal in size - if you run something like

{code}
select ss_item_sk as k from store_sales
intersects 
select i_item_sk  as k from item where i_category = 'Sports'
{code}

vs

{code}
select distinct ss_item_sk as k from store_sales where ss_item_sk IN (select i_item_sk  as k from item where i_category = 'Sports').
{code}

you can see the vast difference between the two approaches.

However, I think this might need a null-safe semi-join, because within intersects null == null.


was (Author: gopalv):
bq. Comparing with left-semi join one, we need to do the join one by one.

That is true at the logical level, but the physical runtime uses the bloom filter semi-join before data-gets shuffled & the join operations actually do not force a full sort (while the group-by does).

The current implementation's achilles heel is when the two branches are unequal in size - if you run something like

{code}
select ss_item_sk as k from store_sales
intersects 
select i_item_sk  as k from item where i_category = 'Sports'
{code}

vs

{code}
select ss_item_sk as k from store_sales where ss_item_sk IN (select i_item_sk  as k from item where i_category = 'Sports').
{code}

you can see the vast difference between the two approaches.

However, I think this might need a null-safe semi-join, because within intersects null == null.

> Rewrite INTERSECT into LEFT SEMI JOIN instead of UNION + Group by
> -----------------------------------------------------------------
>
>                 Key: HIVE-20867
>                 URL: https://issues.apache.org/jira/browse/HIVE-20867
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Planning
>    Affects Versions: 4.0.0
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)