You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2010/06/23 00:17:53 UTC

[jira] Created: (PIG-1461) support union operation that merges based on column names

support union operation that merges based on column names
---------------------------------------------------------

                 Key: PIG-1461
                 URL: https://issues.apache.org/jira/browse/PIG-1461
             Project: Pig
          Issue Type: New Feature
          Components: impl
    Affects Versions: 0.8.0
            Reporter: Thejas M Nair
             Fix For: 0.8.0


When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
The behavior of existing union operator should remain backward compatible .

This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?

example -

L1 = load 'x' as (a,b);
L2 = load 'y' as (b,c);
U = unionschema L1, L2;

describe U;
U: {a:bytearray, b:byetarray, c:bytearray}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895414#action_12895414 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

Regarding 5, there are some differences in the way schema merge is done in both cases. I have created PIG-1536 to discuss/address this  .
I will make changes to address other comments.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Attachment: PIG-1461.patch

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Status: Patch Available  (was: Open)

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Attachment: PIG-1461.2.patch

Patch with changes as suggested in code review.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895383#action_12895383 ] 

Olga Natkovich commented on PIG-1461:
-------------------------------------

The patch looks good. A couple of comments:

(1) Looks like there is a type in the code that loads data for testing:
w.println("5\tdef\t3\t{(2,a),(2,b)}]"); - contains an extra "]" at the end
(2) This is not related to the patch but to the documentation above. Please, add info that UNION supports 2 or more inputs.
(3) In mergeSchemasByAlias, I think it is safer to make copy of the schema rather than just assigning it for the corner case of 1 schema.
(4) Need to add a comment about inner bag schema to mergeFieldSchemaFirstLevelSameAlias
(5) General comment on schema merging - we have completely different code path for posiiton vs. alias based merge. I am worried that we will have subtly different semantics either now or later.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892892#action_12892892 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

The syntax - " union ... using 'merge' " introduces a new use of the " using '...' " clause. So far this clause has been used to indicate the implementation algorithm and it did not have any impact on the semantics. 
Instead a key word might be better if we are trying to avoid introducing another top level operator, similar to the case of outer joins -   eg - " union onschema L1, L2;"

More suggestions/opinions on the syntax for this feature are welcome.



> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895082#action_12895082 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

Documentation for UNION ONSCHEMA:

Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position. 
If the following requirements are not met, the statement will throw an error :
 * All inputs to the union should have a non null schema.
 * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible.


Example - 
{code}
grunt> L1 = load 'f1' using (a : int, b : float);
grunt> dump L1;
(11,12.0)
(21,22.0)

grunt> L2 = load 'f1' using (a : long, c : chararray);
grunt> dump L2;
(11,a)
(12,b)
(13,c)

grunt> U = union onschema L1, L2;
grunt> describe U ;
U : {a : long, b : float, c : chararray}

grunt> dump U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)


{code}


> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Release Note: 
Documentation for UNION ONSCHEMA:

Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position.
If the following requirements are not met, the statement will throw an error :

    * All inputs to the union should have a non null schema.
    * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible.

Example -

grunt> L1 = load 'f1' using (a : int, b : float);
grunt> dump L1;
(11,12.0)
(21,22.0)

grunt> L2 = load 'f1' using (a : long, c : chararray);
grunt> dump L2;
(11,a)
(12,b)
(13,c)

grunt> U = union onschema L1, L2;
grunt> describe U ;
U : {a : long, b : float, c : chararray}

grunt> dump U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)

Like the default union, 'union onschema' also supports 2 or more inputs.



> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Status: Open  (was: Patch Available)

Canceling patch because of an issue found while i was creating example for documentation. 

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Release Note: 
Documentation for UNION ONSCHEMA:

Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position.
If the following requirements are not met, the statement will throw an error :

    * All inputs to the union should have a non null schema.
    * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible.

Example -

grunt> L1 = load 'f1' using (a : int, b : float);
grunt> dump L1;
(11,12.0)
(21,22.0)

grunt> L2 = load 'f1' using (a : long, c : chararray);
grunt> dump L2;
(11,a)
(12,b)
(13,c)

grunt> U = union onschema L1, L2;
grunt> describe U ;
U : {a : long, b : float, c : chararray}

grunt> dump U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)

Note:
- Alias such as 'nm::c1' and 'c1' in two separate relations specified in 'union onschema' are considered mergeable and in the schema of the union, the merged column alias will be 'c1'.
- Alias such as 'nm1::c1' and 'nm2::c1' in two separate relations specified in 'union onschema' will not be merged together, in schema of the union there will be two columns with these names.

Example -

> describe f;
f: {l1::a: int, l1::b: int, l1::c: int}
> describe l1;
l1: {a: int, b: int}

> u = union onschema f,l1;
> desc u;
u: {a: int, b: int, l1::c: int} 

Like the default union, 'union onschema' also supports 2 or more inputs.



  was:
Documentation for UNION ONSCHEMA:

Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position.
If the following requirements are not met, the statement will throw an error :

    * All inputs to the union should have a non null schema.
    * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible.

Example -

grunt> L1 = load 'f1' using (a : int, b : float);
grunt> dump L1;
(11,12.0)
(21,22.0)

grunt> L2 = load 'f1' using (a : long, c : chararray);
grunt> dump L2;
(11,a)
(12,b)
(13,c)

grunt> U = union onschema L1, L2;
grunt> describe U ;
U : {a : long, b : float, c : chararray}

grunt> dump U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)

Like the default union, 'union onschema' also supports 2 or more inputs.




Adding release note section of PIG-1610 to this release note.


> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895067#action_12895067 ] 

Hadoop QA commented on PIG-1461:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451133/PIG-1461.patch
  against trunk revision 980930.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 405 release audit warnings (more than the trunk's current 403 warnings).

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/370/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/370/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/370/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/370/console

This message is automatically generated.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895308#action_12895308 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

Contrib tests pass on my local machine.
The release audit warnings are from jdiff of javadoc changes.
Patch is ready for review.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Attachment: PIG-1461.1.patch

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892903#action_12892903 ] 

Alan Gates commented on PIG-1461:
---------------------------------

+1 on Thejas comment that so far using indicates implementation not semantic change and it's best to keep it that way.  "union onschema" seems fine, as this seems equivalent to "join outer".

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893307#action_12893307 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

The pseudo column containing the source relation, proposed in the first comment seems unnecessary. If user requires the source information to be available, they can project that in an additional foreach before the union. 



> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

    Status: Patch Available  (was: Open)

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895430#action_12895430 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

Regarding Documentation for UNION ONSCHEMA:  -
As Olga mentioned, like the default union, 'union onschema' also supports 2 or more inputs.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881474#action_12881474 ] 

Ashutosh Chauhan commented on PIG-1461:
---------------------------------------

w.r.t language I think
{code}
 U = union L1, L2 using 'merge';
{code}
is better then 
{code}
U = unionschema L1,L2;
{code}

Because U is indeed union with duplicated columns eliminated. User doesn't need to learn about a new operator. 
Internally for Pig, its better to avoid introducing new physical operator if we can.


> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895212#action_12895212 ] 

Hadoop QA commented on PIG-1461:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451175/PIG-1461.1.patch
  against trunk revision 981984.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 407 release audit warnings (more than the trunk's current 405 warnings).

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/console

This message is automatically generated.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1461) support union operation that merges based on column names

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-1461:
-----------------------------------

    Assignee: Thejas M Nair

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1461:
-------------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Patch committed to trunk.

> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1461) support union operation that merges based on column names

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881420#action_12881420 ] 

Thejas M Nair commented on PIG-1461:
------------------------------------

This operator will throw an error if the schema for any of the input relations is undefined.

Users often need to lookup  the source relation downstream after the 'unionschema' operation. It will be convenient to project an additional pseudo column whose value is the name of the input relation.
ie, the schema of U in description becomes - U : {a:bytearray, b:bytearray, c:bytearray, source_relation : chararray } 

This feature does not enable a user to do something that was not possible earlier, it just makes the code more easy to maintain - you don't have to change the pig query if you have new columns .
The same results can be obtained using existing pig syntax as shown following query -

L1 = load 'x' as (a,b);
L2 = load 'y' as (b,c);
F1 = foreach L1 generate a, b, null as c, source_relation as 'F1';
F2 = foreach L1 generate null as a, b, c, source_relation as 'F2';
U = union F1, F2;

Note that, in this query if L1 or L2 schema changes, you will need to change F1 or F2 . 



> support union operation that merges based on column names
> ---------------------------------------------------------
>
>                 Key: PIG-1461
>                 URL: https://issues.apache.org/jira/browse/PIG-1461
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.