You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2010/08/31 23:47:54 UTC

[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

    [ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904785#action_12904785 ] 

Olga Natkovich commented on PIG-1506:
-------------------------------------

This is what we need to document:

In the case of GROUP/COGROUP, the data with NULL key from the same input is grouped together. For instance:

Input data:

joe     5       2.5
sam             3.0
bob             3.5

script:

A = load 'small' as (name, age, gpa);
B = group A by age;
dump B;

Output:

(5,{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)})

Note that both records with null age are grouped together.

However, data with null keys from different inputs is considered different and will generate multiple tuples in case of cogroup. For instance:

Input: Self cogroup on the same input.

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = cogroup A by age, B by age;
dump C;

Output:

(5,{(joe,5,2.5)},{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})

Note that there are 2 tuples in the output corresponding to the null key: one that contains tuples from the first input (with no much from the second) and one the other way around.

JOIN adds another interesting twist to this because it follows SQL standard which means that JOIN by default represents inner join which through away all the nulls.

Input: the same as for COGROUP

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by age, B by age;
dump C;

Output:

(joe,5,2.5,joe,5,2.5)

Note that all tuples that had NULL key got filtered out.


> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.