You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2008/08/06 19:40:44 UTC

[jira] Created: (PIG-361) JOIN and cogroup should handle NULLs correctly

JOIN and cogroup should handle NULLs correctly
----------------------------------------------

                 Key: PIG-361
                 URL: https://issues.apache.org/jira/browse/PIG-361
             Project: Pig
          Issue Type: Sub-task
    Affects Versions: types_branch
            Reporter: Pradeep Kamath
             Fix For: types_branch


JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.

Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-361:
----------------------------------

    Assignee: Alan Gates

> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-361:
---------------------------

    Attachment: PIG-361.patch

This patch makes a number of changes.  It removes IndexedTuple.  Instead values are passed between map and reduce jobs as NullableTuples.  These extend WritableComparable and contain a tuple.  They also have bytes to indicate whether a tuple is null and which part of a join it comes from.

A new type PigNullableWritable has been added.  All of the NullableXWritable types now extend this (including NullableTuple).  Keys passed between map and reduce jobs are now of this type.  This allows the sorting to be done on the index but not the grouping or partitioning.

I also found a major problem in the SortParitioner.  It was assuming all input were tuples and then applying the raw comparator.  But in 2.0 we do not use tuples in the case of a single key.  So I modified SortPartitioner to correctly determine the key type and use the correct type of comparator.



> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: PIG-361.patch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-361.
----------------------------

    Resolution: Fixed

PIG-361.patch committed.

> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: PIG-361.patch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-361:
-------------------------------

    Priority: Critical  (was: Major)

> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628477#action_12628477 ] 

Olga Natkovich commented on PIG-361:
------------------------------------

After having further discussion, here is what I think is the right thing to do:

(1) Cogroup distinguishes between NULL keys from different relations by creating separate records

A = load ...
B = load ...
C = congroup A by $0, B by $0;
...

Assuming that both A and B contain null values in the key column, C would look as follows:

{
....
NULL,  {.....}, {}
NULL, {}, {...}
....
}

The first record corresponds to all records of A with NULL key and the second with record of B with empty key.

(2) This is consistent with SQL semantics that NULLs are not the same. It will make JOIN work as is and also outer join expressed as COGROUP + FOREACH with Bincond work as with earlier versions.

(3) The required work is to add relation id to the comparison function. Join optimization already does that. We will try to piggyback this issue onto join optimization

> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-361:
----------------------------------

    Assignee: Shravan Matthur Narayanamurthy  (was: Alan Gates)

Shravan, could you take a look plaese. 

I think we want to preserve SQL semantics here:

JOIN / INNER COGROUP - throws away all NULLs
OUTER COGROUP + flatten with bincond  simulates outer joins where missing data is padded by NULLs and  nulls are assumed to be all different - never multiplied.

> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632774#action_12632774 ] 

Olga Natkovich commented on PIG-361:
------------------------------------

+1 on the patch

Couple of small comments:

(1) NullableBag and NullableTuple can use static factory to avoid "if" on every bag/tuple construction
(2) Index is currenty attached to all data even if we only have 1 stream. It is only a single byte but we could optimize a bit further here later

> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: PIG-361.patch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-361) JOIN and cogroup should handle NULLs correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-361:
----------------------------------

    Assignee: Alan Gates  (was: Shravan Matthur Narayanamurthy)

Reassigning back to Alan since he is looking into join optimization

> JOIN and cogroup should handle NULLs correctly
> ----------------------------------------------
>
>                 Key: PIG-361
>                 URL: https://issues.apache.org/jira/browse/PIG-361
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>
> JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
> Cogroup should coalesce all NULL key rows into one group.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.