You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by hc busy <hc...@gmail.com> on 2010/05/06 08:24:01 UTC

ugh! casting migrane

okay, I have to blow some steam here, did you know that if

describe A;
A: {id: int, bad: (a: int,b: int,z: int)}

and I do

B = foreach A generate id, FLATTEN(bad) as c;

That this would actually run without error and that c takes value of a, and
then an anonymous field is created for b. (So, b is not dropped by this
cast)

I wonder if either the "B =" statement should generate an error, OR
it would rename a to c and drop the column b ?
The statement:

B = foreach A generate id, FLATTEN(bad) as (c,d);
describe B;
B: {id: int,c: int,d:int}

Seems to make more sense than a silent non-dropping result.

Re: ugh! casting migrane

Posted by hc busy <hc...@gmail.com>.
I guess it's Turing hard problem to generate warnings and errors. Even in
SQL one can do

select author as book_tittle from books;

and get things mixed up...


On Thu, May 6, 2010 at 2:21 PM, Scott Carey <sc...@richrelevance.com> wrote:

>
> On May 6, 2010, at 12:14 AM, Dmitriy Ryaboy wrote:
>
> > Does it surprise you that "select a as foo, b, d" return 3 columns?
> > You only gave one alias... this works the same way.
> >
> > It's the opposite that surprises me -- that if you load multi-column
> > data and only provide names for the first few columns, you can't
> > access the rest by ordinal.
> >
> > -D
> >
>
> If you have
>
> X: { a: int, b: int, c: int}
>
> Y = FOREACH X GENERATE a, b;
> does not leave 'c' in there as $2.  These aren't exactly the same, but it
> is where the confusion is coming from.
>
> The confusion is that FOREACH ... GENERATE is a projection operation, and
> in the case sited here it does not project and remove unreferenced fields.
> To me, it is not surprising that FLATTEN on a tuple with an alias
> assignment doesn't remove unnamed fields, but it is somewhat surprising that
> the FOREACH ... GENERATE wrapping it doesn't.
>
> B1 = FOREACH A GENERATE id, FLATTEN(bad);
> B = FOREACH B1 GENERATE id, bad::a as a;
>
> works.
>
> At least in 0.5 the below inconsistently works: ('.' as a tuple dereference
> projection kills combiner optimization, and on occasion fails to run in much
> more complicated scenarios, so I avoid it).
> B = FOREACH A GENERATE id, bad.a as a;
>
>
>
> The confusion is that FOREACH ... GENERATE is the only supported means of
> projection, but it doesn't always project the fields listed.  In a FOREACH
> ... GENERATE the projection occurs _BEFORE_ alias assignment.
>
>
>
>
> > On Wed, May 5, 2010 at 11:24 PM, hc busy <hc...@gmail.com> wrote:
> >> okay, I have to blow some steam here, did you know that if
> >>
> >> describe A;
> >> A: {id: int, bad: (a: int,b: int,z: int)}
> >>
> >> and I do
> >>
> >> B = foreach A generate id, FLATTEN(bad) as c;
> >>
> >> That this would actually run without error and that c takes value of a,
> and
> >> then an anonymous field is created for b. (So, b is not dropped by this
> >> cast)
> >>
> >> I wonder if either the "B =" statement should generate an error, OR
> >> it would rename a to c and drop the column b ?
> >> The statement:
> >>
> >> B = foreach A generate id, FLATTEN(bad) as (c,d);
> >> describe B;
> >> B: {id: int,c: int,d:int}
> >>
> >> Seems to make more sense than a silent non-dropping result.
> >>
>
>

Re: ugh! casting migrane

Posted by Scott Carey <sc...@richrelevance.com>.
On May 6, 2010, at 12:14 AM, Dmitriy Ryaboy wrote:

> Does it surprise you that "select a as foo, b, d" return 3 columns?
> You only gave one alias... this works the same way.
> 
> It's the opposite that surprises me -- that if you load multi-column
> data and only provide names for the first few columns, you can't
> access the rest by ordinal.
> 
> -D
> 

If you have 

X: { a: int, b: int, c: int}

Y = FOREACH X GENERATE a, b;
does not leave 'c' in there as $2.  These aren't exactly the same, but it is where the confusion is coming from.

The confusion is that FOREACH ... GENERATE is a projection operation, and in the case sited here it does not project and remove unreferenced fields.  
To me, it is not surprising that FLATTEN on a tuple with an alias assignment doesn't remove unnamed fields, but it is somewhat surprising that the FOREACH ... GENERATE wrapping it doesn't.

B1 = FOREACH A GENERATE id, FLATTEN(bad);
B = FOREACH B1 GENERATE id, bad::a as a;

works.

At least in 0.5 the below inconsistently works: ('.' as a tuple dereference projection kills combiner optimization, and on occasion fails to run in much more complicated scenarios, so I avoid it).
B = FOREACH A GENERATE id, bad.a as a;



The confusion is that FOREACH ... GENERATE is the only supported means of projection, but it doesn't always project the fields listed.  In a FOREACH ... GENERATE the projection occurs _BEFORE_ alias assignment.




> On Wed, May 5, 2010 at 11:24 PM, hc busy <hc...@gmail.com> wrote:
>> okay, I have to blow some steam here, did you know that if
>> 
>> describe A;
>> A: {id: int, bad: (a: int,b: int,z: int)}
>> 
>> and I do
>> 
>> B = foreach A generate id, FLATTEN(bad) as c;
>> 
>> That this would actually run without error and that c takes value of a, and
>> then an anonymous field is created for b. (So, b is not dropped by this
>> cast)
>> 
>> I wonder if either the "B =" statement should generate an error, OR
>> it would rename a to c and drop the column b ?
>> The statement:
>> 
>> B = foreach A generate id, FLATTEN(bad) as (c,d);
>> describe B;
>> B: {id: int,c: int,d:int}
>> 
>> Seems to make more sense than a silent non-dropping result.
>> 


Re: ugh! casting migrane

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Does it surprise you that "select a as foo, b, d" return 3 columns?
You only gave one alias... this works the same way.

It's the opposite that surprises me -- that if you load multi-column
data and only provide names for the first few columns, you can't
access the rest by ordinal.

-D

On Wed, May 5, 2010 at 11:24 PM, hc busy <hc...@gmail.com> wrote:
> okay, I have to blow some steam here, did you know that if
>
> describe A;
> A: {id: int, bad: (a: int,b: int,z: int)}
>
> and I do
>
> B = foreach A generate id, FLATTEN(bad) as c;
>
> That this would actually run without error and that c takes value of a, and
> then an anonymous field is created for b. (So, b is not dropped by this
> cast)
>
> I wonder if either the "B =" statement should generate an error, OR
> it would rename a to c and drop the column b ?
> The statement:
>
> B = foreach A generate id, FLATTEN(bad) as (c,d);
> describe B;
> B: {id: int,c: int,d:int}
>
> Seems to make more sense than a silent non-dropping result.
>