You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/05/05 20:21:13 UTC

Automatic alias generation in cogroups

Currently in pig, aliases are generally only assigned by the user.  
There is one exception to this rule, which is (co)group.  Consider a 
script like:

a = load 'myfile';
b = load 'anotherfile';
c = cogroup a by $0, b by $0;

The relation c will have the aliases: group, a, b without the user 
having assigned those names.

There are a couple of problems with this.  First, we've had a number of 
users complain that this is confusing.  a and b are suddenly overloaded 
terms in the script.  Consider, for example, that both of the following 
lines are possible and refer to entirely different meanings for 'a':

d = filter a by $0 eq 'fred';
d = foreach c generate count(a);

In the first line, 'a' refers to the relation produced by the load.  In 
the second, it refers to the bag that is the second field ($1) of the 
relation 'c'.  The same holds for 'group' which is now both a keyword 
and an alias (yuck!).

The second issue is that this is generally inconsistent.  Everywhere 
else pig latin allows users to define aliases, but here it does it 
automatically.

So the proposal is to remove this automatic aliasing from cogroup.  
Cogroup would support AS, so that users could define aliases for these 
bags if they desired.  This may be a little difficult, as users need to 
remember to provide an alias for the group before aliasing the bags.  
For example, taking the script above:
c = cogroup a by $0, b by $0 as name, file1, file2;

So name would now be the alias for the group key (formerly aliased as 
'group'), file1 for the first bag (formerly 'a') and file2 for the 
second bag (formerly 'b').

Everything said in this applies to group as well as cogroup.

Obviously this change isn't backward compatible.

Thoughts?

Alan.

Re: Automatic alias generation in cogroups

Posted by Benjamin Reed <br...@yahoo-inc.com>.
I agree that the aliases used should be overridable, but much perfer the 
current way as a default rather than $0, $1, ... Once you know what's 
happening it makes sense and it's easy to use.

The complaint about overloading reflects a misunderstanding of the scoping of 
foreach. If they don't understand it now, schemas are bound to confuse as 
well if there is any kind of conflict.

Adding the AS to cogroup sounds good. What about the fields of group?

cogroup a by (age, height), b by (avgAge, avgAge);

shouldn't you be able to pick the schema of group?


ben

On Monday 05 May 2008 11:21:13 Alan Gates wrote:
> Currently in pig, aliases are generally only assigned by the user.
> There is one exception to this rule, which is (co)group.  Consider a
> script like:
>
> a = load 'myfile';
> b = load 'anotherfile';
> c = cogroup a by $0, b by $0;
>
> The relation c will have the aliases: group, a, b without the user
> having assigned those names.
>
> There are a couple of problems with this.  First, we've had a number of
> users complain that this is confusing.  a and b are suddenly overloaded
> terms in the script.  Consider, for example, that both of the following
> lines are possible and refer to entirely different meanings for 'a':
>
> d = filter a by $0 eq 'fred';
> d = foreach c generate count(a);
>
> In the first line, 'a' refers to the relation produced by the load.  In
> the second, it refers to the bag that is the second field ($1) of the
> relation 'c'.  The same holds for 'group' which is now both a keyword
> and an alias (yuck!).
>
> The second issue is that this is generally inconsistent.  Everywhere
> else pig latin allows users to define aliases, but here it does it
> automatically.
>
> So the proposal is to remove this automatic aliasing from cogroup.
> Cogroup would support AS, so that users could define aliases for these
> bags if they desired.  This may be a little difficult, as users need to
> remember to provide an alias for the group before aliasing the bags.
> For example, taking the script above:
> c = cogroup a by $0, b by $0 as name, file1, file2;
>
> So name would now be the alias for the group key (formerly aliased as
> 'group'), file1 for the first bag (formerly 'a') and file2 for the
> second bag (formerly 'b').
>
> Everything said in this applies to group as well as cogroup.
>
> Obviously this change isn't backward compatible.
>
> Thoughts?
>
> Alan.