You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2018/11/05 19:42:00 UTC
[jira] [Comment Edited] (HIVE-20867) Rewrite INTERSECT into LEFT
SEMI JOIN instead of UNION + Group by
[ https://issues.apache.org/jira/browse/HIVE-20867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675678#comment-16675678 ]
Gopal V edited comment on HIVE-20867 at 11/5/18 7:41 PM:
---------------------------------------------------------
bq. Comparing with left-semi join one, we need to do the join one by one.
That is true at the logical level, but the physical runtime uses the bloom filter semi-join before data-gets shuffled & the join operations actually do not force a full sort (while the group-by does).
The current implementation's achilles heel is when the two branches are unequal in size - if you run something like
{code}
select ss_item_sk as k from store_sales
intersects
select i_item_sk as k from item where i_category = 'Sports'
{code}
vs
{code}
select distinct ss_item_sk as k from store_sales where ss_item_sk IN (select i_item_sk as k from item where i_category = 'Sports').
{code}
you can see the vast difference between the two approaches.
However, I think this might need a null-safe semi-join, because within intersects null == null.
was (Author: gopalv):
bq. Comparing with left-semi join one, we need to do the join one by one.
That is true at the logical level, but the physical runtime uses the bloom filter semi-join before data-gets shuffled & the join operations actually do not force a full sort (while the group-by does).
The current implementation's achilles heel is when the two branches are unequal in size - if you run something like
{code}
select ss_item_sk as k from store_sales
intersects
select i_item_sk as k from item where i_category = 'Sports'
{code}
vs
{code}
select ss_item_sk as k from store_sales where ss_item_sk IN (select i_item_sk as k from item where i_category = 'Sports').
{code}
you can see the vast difference between the two approaches.
However, I think this might need a null-safe semi-join, because within intersects null == null.
> Rewrite INTERSECT into LEFT SEMI JOIN instead of UNION + Group by
> -----------------------------------------------------------------
>
> Key: HIVE-20867
> URL: https://issues.apache.org/jira/browse/HIVE-20867
> Project: Hive
> Issue Type: Improvement
> Components: Query Planning
> Affects Versions: 4.0.0
> Reporter: Vineet Garg
> Assignee: Vineet Garg
> Priority: Major
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)