You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2010/07/19 20:15:51 UTC

[jira] Created: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

Need to clarify the difference between null handling in JOIN and COGROUP
------------------------------------------------------------------------

                 Key: PIG-1506
                 URL: https://issues.apache.org/jira/browse/PIG-1506
             Project: Pig
          Issue Type: Improvement
          Components: documentation
            Reporter: Olga Natkovich
            Assignee: Corinne Chandel
             Fix For: 0.8.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

Posted by "Corinne Chandel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Corinne Chandel resolved PIG-1506.
----------------------------------

    Resolution: Fixed

Pig-Latin-Ref-Manual-1 updated to include new section (Null Values). Patch submitted via pig-1600.

> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904829#action_12904829 ] 

Olga Natkovich commented on PIG-1506:
-------------------------------------

I verified that 0.8 code does deal correctly with multi-column keys with nulls

> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904835#action_12904835 ] 

Scott Carey commented on PIG-1506:
----------------------------------

I have just confirmed that on 0.7 it works fine, but 0.5 does not. So this was fixed in 0.6 or 0.7.  I suppose I can take out some null guards from my scripts now :)

This was my test:

{code}
A = LOAD '/tmp/test.txt' as (a,b,c);
B = LOAD '/tmp/test.txt' as (a,b,c);
C = JOIN A by (a,b), B by (a,b);

DUMP A;
DUMP C;
{code}

With 0.5 I get:
A:
(fred,1,3)
(bob,,4)
C:
(bob,,4,bob,,4)
(fred,1,3,fred,1,3)

and with 0.7 C is:
(fred,1,3,fred,1,3)



> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904819#action_12904819 ] 

Scott Carey commented on PIG-1506:
----------------------------------

The SQL behavior of the above for an outer join would be to have five rows output -- just like COGROUP would if flattened.  So that seems fine to me.  A self-join should be the same as a COGROUP with yourself, which is different than a simple GROUP.

However, there is a problem with inner join and nulls.
Pig JOIN is not like SQL with respect to nulls on multi-column joins.  (I have not tried on trunk however)

In SQL, if ANY of the columns in a multi-column join is null, the row is not output. 

Try:

{code}
A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by (name,age), B by (name,age);
dump C;
{code}

The result for SQL would be one row of the form 
joe 5 2.5 joe 5 2.5



> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904785#action_12904785 ] 

Olga Natkovich commented on PIG-1506:
-------------------------------------

This is what we need to document:

In the case of GROUP/COGROUP, the data with NULL key from the same input is grouped together. For instance:

Input data:

joe     5       2.5
sam             3.0
bob             3.5

script:

A = load 'small' as (name, age, gpa);
B = group A by age;
dump B;

Output:

(5,{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)})

Note that both records with null age are grouped together.

However, data with null keys from different inputs is considered different and will generate multiple tuples in case of cogroup. For instance:

Input: Self cogroup on the same input.

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = cogroup A by age, B by age;
dump C;

Output:

(5,{(joe,5,2.5)},{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})

Note that there are 2 tuples in the output corresponding to the null key: one that contains tuples from the first input (with no much from the second) and one the other way around.

JOIN adds another interesting twist to this because it follows SQL standard which means that JOIN by default represents inner join which through away all the nulls.

Input: the same as for COGROUP

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by age, B by age;
dump C;

Output:

(joe,5,2.5,joe,5,2.5)

Note that all tuples that had NULL key got filtered out.


> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.