You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2009/10/06 01:51:31 UTC

[jira] Created: (HIVE-870) semi joins

semi joins
----------

                 Key: HIVE-870
                 URL: https://issues.apache.org/jira/browse/HIVE-870
             Project: Hadoop Hive
          Issue Type: New Feature
            Reporter: Ning Zhang
            Assignee: Ning Zhang


Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,

select * 
from A
where A.id IN 
   (select id
    from B
    where B.date> '2009-10-01');

returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-870) semi joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774863#action_12774863 ] 

Namit Jain commented on HIVE-870:
---------------------------------

Can you add more tests with STREAMTABLE also ?

Do you want to separate out the comment changes and file a new jira for that ?
That is blowing up the number of files, and making it difficult to review. If you
think that will help, please file a new jira and submit a patch for that - I will try to 
take a look at that asap.

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch, Hive-870_2.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-870) semi joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-870:
----------------------------

    Attachment: Hive-870_3.patch

A new patch Hive-870_3.patch is attached. This patch reverted the change to QTestUtils.java to handle comments in unit tests and output file changes due to this. 

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch, Hive-870_2.patch, Hive-870_3.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-870) semi joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774860#action_12774860 ] 

Namit Jain commented on HIVE-870:
---------------------------------



PREHOOK: query: select a.key from t3 a left outer join t1 b on a.key = b.key left semi join t2 c on b.key\
 = c.key sort by a.key


Can you check the output ? I need to double-check it, but it looks wrong at the first glance.

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch, Hive-870_2.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-870) semi joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-870:
----------------------------

    Attachment: Hive-870_2.patch

Uploading a new patch Hive-870_2.patch. This solves the issue in map-side left semi join, also fixed the case for mixing with right/full outer join.

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch, Hive-870_2.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-870) semi joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774862#action_12774862 ] 

Namit Jain commented on HIVE-870:
---------------------------------

Ignore my earlier comment, i didnt see it is a outer join, and not a semi join

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch, Hive-870_2.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-870) Implement LEFT SEMI JOIN

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-870:
--------------------------------

        Summary: Implement LEFT SEMI JOIN  (was: semi joins)
    Component/s: Query Processor

> Implement LEFT SEMI JOIN
> ------------------------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.5.0
>
>         Attachments: Hive-870.patch, Hive-870_2.patch, Hive-870_3.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-870) semi joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-870:
----------------------------

    Attachment: Hive-870.patch

Uploading Hive-870.patch. This patch includes the following changes:

1) enhance the HiveQL syntax to support left semi join.
2) introduce a new left semi join type in the CommonJoinOperator. This join operator implements early-exit whenever a match is found in the right-hand-side table of the left semi join.
3) At the map side, add a select operator to project the join keys only of the RHS table, followed by a map-side partial group-by operator that eliminate duplicate keys. We only need the key, the value is NULL. 
4) if the RHS is used as map-side join, only the selection operator is introduced. The map-side groupby operator is not necessary.
5) some misc clean ups (e.g., allowing '--' comments appear in any place in the unit test qfiles). A lot of unit test diffs are due to this change. All the unit tests for semi join are in semijoin.q.

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-870) semi joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain resolved HIVE-870.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.5.0
     Hadoop Flags: [Reviewed]

Committed. Thanks Ning

> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.5.0
>
>         Attachments: Hive-870.patch, Hive-870_2.patch, Hive-870_3.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-870) semi joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773736#action_12773736 ] 

Namit Jain commented on HIVE-870:
---------------------------------

talked with Ning offline.

2 things need to be tested:

1. Why can't we have a group by in case of map join
2. There seems to be a problem in the case: A left semi join B right outer join C 


> semi joins
> ----------
>
>                 Key: HIVE-870
>                 URL: https://issues.apache.org/jira/browse/HIVE-870
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: Hive-870.patch
>
>
> Semi-join is an efficient way to unnest an IN/EXISTS subquery. For example,
> select * 
> from A
> where A.id IN 
>    (select id
>     from B
>     where B.date> '2009-10-01');
> returns from A whose ID is in the set of IDs found in B, whose date is greater than a certain date. This query can be unnested using a INNER join or LEFT OUTER JOIN, but we need to deduplicate the IDs returned by the subquery on table B. The semantics of LEFT SEMI JOIN is that as long as there is ANY row in the right-hand table that matches the join key, the left-hand table row will be emitted as a result w/o necessarily looking further in the right-hand table for further matches. This is exactly the semantics of the IN subquery. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.