You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2010/06/23 00:33:51 UTC

[jira] Commented: (PIG-1461) support union operation that merges based on column names

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881420#action_12881420 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

This operator will throw an error if the schema for any of the input relations is undefined.

Users often need to lookup  the source relation downstream after the 'unionschema' operation. It will be convenient to project an additional pseudo column whose value is the name of the input relation.
ie, the schema of U in description becomes - U : {a:bytearray, b:bytearray, c:bytearray, source_relation : chararray } 

This feature does not enable a user to do something that was not possible earlier, it just makes the code more easy to maintain - you don't have to change the pig query if you have new columns .
The same results can be obtained using existing pig syntax as shown following query -

L1 = load 'x' as (a,b);
L2 = load 'y' as (b,c);
F1 = foreach L1 generate a, b, null as c, source_relation as 'F1';
F2 = foreach L1 generate null as a, b, c, source_relation as 'F2';
U = union F1, F2;

Note that, in this query if L1 or L2 schema changes, you will need to change F1 or F2 . 



> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.