You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/05/05 20:21:13 UTC
Automatic alias generation in cogroups
Currently in pig, aliases are generally only assigned by the user.
There is one exception to this rule, which is (co)group. Consider a
script like:
a = load 'myfile';
b = load 'anotherfile';
c = cogroup a by $0, b by $0;
The relation c will have the aliases: group, a, b without the user
having assigned those names.
There are a couple of problems with this. First, we've had a number of
users complain that this is confusing. a and b are suddenly overloaded
terms in the script. Consider, for example, that both of the following
lines are possible and refer to entirely different meanings for 'a':
d = filter a by $0 eq 'fred';
d = foreach c generate count(a);
In the first line, 'a' refers to the relation produced by the load. In
the second, it refers to the bag that is the second field ($1) of the
relation 'c'. The same holds for 'group' which is now both a keyword
and an alias (yuck!).
The second issue is that this is generally inconsistent. Everywhere
else pig latin allows users to define aliases, but here it does it
automatically.
So the proposal is to remove this automatic aliasing from cogroup.
Cogroup would support AS, so that users could define aliases for these
bags if they desired. This may be a little difficult, as users need to
remember to provide an alias for the group before aliasing the bags.
For example, taking the script above:
c = cogroup a by $0, b by $0 as name, file1, file2;
So name would now be the alias for the group key (formerly aliased as
'group'), file1 for the first bag (formerly 'a') and file2 for the
second bag (formerly 'b').
Everything said in this applies to group as well as cogroup.
Obviously this change isn't backward compatible.
Thoughts?
Alan.
Re: Automatic alias generation in cogroups
Posted by Benjamin Reed <br...@yahoo-inc.com>.
I agree that the aliases used should be overridable, but much perfer the
current way as a default rather than $0, $1, ... Once you know what's
happening it makes sense and it's easy to use.
The complaint about overloading reflects a misunderstanding of the scoping of
foreach. If they don't understand it now, schemas are bound to confuse as
well if there is any kind of conflict.
Adding the AS to cogroup sounds good. What about the fields of group?
cogroup a by (age, height), b by (avgAge, avgAge);
shouldn't you be able to pick the schema of group?
ben
On Monday 05 May 2008 11:21:13 Alan Gates wrote:
> Currently in pig, aliases are generally only assigned by the user.
> There is one exception to this rule, which is (co)group. Consider a
> script like:
>
> a = load 'myfile';
> b = load 'anotherfile';
> c = cogroup a by $0, b by $0;
>
> The relation c will have the aliases: group, a, b without the user
> having assigned those names.
>
> There are a couple of problems with this. First, we've had a number of
> users complain that this is confusing. a and b are suddenly overloaded
> terms in the script. Consider, for example, that both of the following
> lines are possible and refer to entirely different meanings for 'a':
>
> d = filter a by $0 eq 'fred';
> d = foreach c generate count(a);
>
> In the first line, 'a' refers to the relation produced by the load. In
> the second, it refers to the bag that is the second field ($1) of the
> relation 'c'. The same holds for 'group' which is now both a keyword
> and an alias (yuck!).
>
> The second issue is that this is generally inconsistent. Everywhere
> else pig latin allows users to define aliases, but here it does it
> automatically.
>
> So the proposal is to remove this automatic aliasing from cogroup.
> Cogroup would support AS, so that users could define aliases for these
> bags if they desired. This may be a little difficult, as users need to
> remember to provide an alias for the group before aliasing the bags.
> For example, taking the script above:
> c = cogroup a by $0, b by $0 as name, file1, file2;
>
> So name would now be the alias for the group key (formerly aliased as
> 'group'), file1 for the first bag (formerly 'a') and file2 for the
> second bag (formerly 'b').
>
> Everything said in this applies to group as well as cogroup.
>
> Obviously this change isn't backward compatible.
>
> Thoughts?
>
> Alan.