You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mat Kelcey <ma...@gmail.com> on 2011/07/11 07:47:01 UTC

trouble with syntax for flatten in a foreach

hi,

i've got a pretty simple transform of data i need to do and i can't for the
life of me work it out.
i feel like i'm missing something trivial...

i want to go from this...
person key    value
bob    age    25
bob    colour red
fred   age    30
fred   food   bagels

to this...
person age colour food
bob    25  red    null
fred   30  null   bagels

here's the best i can do....

> data = load 'blah' as (uid:chararray, key:chararray, value:chararray);
-- data: {uid: chararray,key: chararray,value: chararray}
(bob,age,25)
(bob,colour,red)
(fred,age,30)
(fred,food,bagels)

> split data into
    by_age    if key=='age',
    by_colour if key=='colour',
    by_food   if key=='food';

> cogrouped = cogroup by_age by uid, by_colour by uid, by_food by uid;
-- cogrouped: {group: chararray,by_age: {(uid: chararray,key:
chararray,value: chararray)},by_colour: {(uid: chararray,key:
chararray,value: chararray)},by_food: {(uid: chararray,key: chararray,value:
chararray)}}
(bob,{(bob,age,25)},{(bob,colour,red)},{})
(fred,{(fred,age,30)},{},{(fred,food,bagels)})

> flattened = foreach cogrouped generate group as uid, by_age.value as age,
by_colour.value as colour, by_food.value as food;
-- flattened: {uid: chararray,age: {(value: chararray)},colour: {(value:
chararray)},food: {(value: chararray)}}
(bob,{(25)},{(red)},{})
(fred,{(30)},{},{(bagels)})

any attempt to call flatten on the tuples, eg
> flattened = foreach cogrouped generate group as uid,
flatten(by_food.value) as food;
and i lose the entries that had a empty bag for food (eg bob in this case)

i've got a feeling isempty might get me somewhere and

> flattened = foreach cogrouped generate
   group as uid,
   (IsEmpty(by_food.value) ? 0 : 1);
(bob,0)
(fred,1)

but any attempt to use a real value in there fails, i can't get the syntax
correct.
> flattened = foreach cogrouped generate
       group as uid,
       (IsEmpty(by_food.value) ? {} : by_food.value);

not sure how to define an empty bag for the left hand side of the bin cond?

i must be missing something fundamental somewhere.
help me obiwan kanobi, you're my only hope.

cheers,
mat

Re: trouble with syntax for flatten in a foreach

Posted by Mat Kelcey <ma...@gmail.com>.
i take it all back
 generate group as uid,
  flatten((IsEmpty(fil_height) ? {('')} : fil_height.value)) as height;

does work

thanks for the help
mat

On 11 July 2011 15:44, Mat Kelcey <ma...@gmail.com> wrote:

> Thanks Thejas,
> I was using pig0.9 (last nights trunk) and couldn't get the bincond +
> flatten combo to work...
> I'll reproduce tonight (if i get time) and reply with exact messaging...
> Cheers,
> Mat
>
> On 11 July 2011 12:21, Thejas Nair <th...@hortonworks.com> wrote:
>
>> The nested-foreach statement is your friend!
>>
>> l = load 'b.pig' as (uid:chararray, key:chararray, value:chararray);
>> g = group l by uid;
>> f = foreach g {
>>            fil_age = filter l by key == 'age';
>>            fil_colour = filter l by key == 'colour' ;
>>            fil_food = filter l by key == 'food';
>>
>>    generate group as uid,
>>                   MAX(fil_age.value) as age,
>>                   MAX(fil_colour.value) as value,
>>                   MAX(fil_food.value) as food;
>> }
>>
>> I have used Jacob's idea of using MAX, i think that's more cleaner than
>> flatten + bincond for this use case.
>>
>> The flatten + bincond syntax in your example should work in 0.9, it has
>> some fixes for schema merging issues.
>>
>> -Thejas
>>
>>
>>
>>
>> On 7/10/11 10:47 PM, Mat Kelcey wrote:
>>
>>> hi,
>>>
>>> i've got a pretty simple transform of data i need to do and i can't for
>>> the
>>> life of me work it out.
>>> i feel like i'm missing something trivial...
>>>
>>> i want to go from this...
>>> person key    value
>>> bob    age    25
>>> bob    colour red
>>> fred   age    30
>>> fred   food   bagels
>>>
>>> to this...
>>> person age colour food
>>> bob    25  red    null
>>> fred   30  null   bagels
>>>
>>> here's the best i can do....
>>>
>>>  data = load 'blah' as (uid:chararray, key:chararray, value:chararray);
>>>>
>>> -- data: {uid: chararray,key: chararray,value: chararray}
>>> (bob,age,25)
>>> (bob,colour,red)
>>> (fred,age,30)
>>> (fred,food,bagels)
>>>
>>>  split data into
>>>>
>>>     by_age    if key=='age',
>>>     by_colour if key=='colour',
>>>     by_food   if key=='food';
>>>
>>>  cogrouped = cogroup by_age by uid, by_colour by uid, by_food by uid;
>>>>
>>> -- cogrouped: {group: chararray,by_age: {(uid: chararray,key:
>>> chararray,value: chararray)},by_colour: {(uid: chararray,key:
>>> chararray,value: chararray)},by_food: {(uid: chararray,key:
>>> chararray,value:
>>> chararray)}}
>>> (bob,{(bob,age,25)},{(bob,**colour,red)},{})
>>> (fred,{(fred,age,30)},{},{(**fred,food,bagels)})
>>>
>>>  flattened = foreach cogrouped generate group as uid, by_age.value as
>>>> age,
>>>>
>>> by_colour.value as colour, by_food.value as food;
>>> -- flattened: {uid: chararray,age: {(value: chararray)},colour: {(value:
>>> chararray)},food: {(value: chararray)}}
>>> (bob,{(25)},{(red)},{})
>>> (fred,{(30)},{},{(bagels)})
>>>
>>> any attempt to call flatten on the tuples, eg
>>>
>>>> flattened = foreach cogrouped generate group as uid,
>>>>
>>> flatten(by_food.value) as food;
>>> and i lose the entries that had a empty bag for food (eg bob in this
>>> case)
>>>
>>> i've got a feeling isempty might get me somewhere and
>>>
>>>  flattened = foreach cogrouped generate
>>>>
>>>    group as uid,
>>>    (IsEmpty(by_food.value) ? 0 : 1);
>>> (bob,0)
>>> (fred,1)
>>>
>>> but any attempt to use a real value in there fails, i can't get the
>>> syntax
>>> correct.
>>>
>>>> flattened = foreach cogrouped generate
>>>>
>>>        group as uid,
>>>        (IsEmpty(by_food.value) ? {} : by_food.value);
>>>
>>> not sure how to define an empty bag for the left hand side of the bin
>>> cond?
>>>
>>> i must be missing something fundamental somewhere.
>>> help me obiwan kanobi, you're my only hope.
>>>
>>> cheers,
>>> mat
>>>
>>>
>>
>

Re: trouble with syntax for flatten in a foreach

Posted by Thejas Nair <th...@hortonworks.com>.
The nested-foreach statement is your friend!

l = load 'b.pig' as (uid:chararray, key:chararray, value:chararray);
g = group l by uid;
f = foreach g {
             fil_age = filter l by key == 'age';
             fil_colour = filter l by key == 'colour' ;
             fil_food = filter l by key == 'food';

     generate group as uid,
                    MAX(fil_age.value) as age,
                    MAX(fil_colour.value) as value,
                    MAX(fil_food.value) as food;
}

I have used Jacob's idea of using MAX, i think that's more cleaner than 
flatten + bincond for this use case.

The flatten + bincond syntax in your example should work in 0.9, it has 
some fixes for schema merging issues.

-Thejas



On 7/10/11 10:47 PM, Mat Kelcey wrote:
> hi,
>
> i've got a pretty simple transform of data i need to do and i can't for the
> life of me work it out.
> i feel like i'm missing something trivial...
>
> i want to go from this...
> person key    value
> bob    age    25
> bob    colour red
> fred   age    30
> fred   food   bagels
>
> to this...
> person age colour food
> bob    25  red    null
> fred   30  null   bagels
>
> here's the best i can do....
>
>> data = load 'blah' as (uid:chararray, key:chararray, value:chararray);
> -- data: {uid: chararray,key: chararray,value: chararray}
> (bob,age,25)
> (bob,colour,red)
> (fred,age,30)
> (fred,food,bagels)
>
>> split data into
>      by_age    if key=='age',
>      by_colour if key=='colour',
>      by_food   if key=='food';
>
>> cogrouped = cogroup by_age by uid, by_colour by uid, by_food by uid;
> -- cogrouped: {group: chararray,by_age: {(uid: chararray,key:
> chararray,value: chararray)},by_colour: {(uid: chararray,key:
> chararray,value: chararray)},by_food: {(uid: chararray,key: chararray,value:
> chararray)}}
> (bob,{(bob,age,25)},{(bob,colour,red)},{})
> (fred,{(fred,age,30)},{},{(fred,food,bagels)})
>
>> flattened = foreach cogrouped generate group as uid, by_age.value as age,
> by_colour.value as colour, by_food.value as food;
> -- flattened: {uid: chararray,age: {(value: chararray)},colour: {(value:
> chararray)},food: {(value: chararray)}}
> (bob,{(25)},{(red)},{})
> (fred,{(30)},{},{(bagels)})
>
> any attempt to call flatten on the tuples, eg
>> flattened = foreach cogrouped generate group as uid,
> flatten(by_food.value) as food;
> and i lose the entries that had a empty bag for food (eg bob in this case)
>
> i've got a feeling isempty might get me somewhere and
>
>> flattened = foreach cogrouped generate
>     group as uid,
>     (IsEmpty(by_food.value) ? 0 : 1);
> (bob,0)
> (fred,1)
>
> but any attempt to use a real value in there fails, i can't get the syntax
> correct.
>> flattened = foreach cogrouped generate
>         group as uid,
>         (IsEmpty(by_food.value) ? {} : by_food.value);
>
> not sure how to define an empty bag for the left hand side of the bin cond?
>
> i must be missing something fundamental somewhere.
> help me obiwan kanobi, you're my only hope.
>
> cheers,
> mat
>