You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Anthony Urso <an...@cs.ucla.edu> on 2010/01/18 08:53:57 UTC

How to run a UDF on the result of GROUP BY

I can't figure out how to run a UDF on the result of "GROUP BY" from
the current documentation.

I'd like to do something along these lines:

A = LOAD 'A';
B = LOAD 'B';
C = JOIN A BY $0, B by $0;
D = GROUP C BY A::$0;
E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
STORE E IN 'E';

Specifically, what goes in the UDF call?  I've tried to use "D" there,
but it errors out claiming a bad alias.  Using "C" compiles, but does
not seem right to me.

Also, what is actually passed to the UDF.exec method in this case?

Thanks,
Anthony

Re: How to run a UDF on the result of GROUP BY

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Anthony,
What's happening is that a UDF gets called on fields, not on the whole
relation. After grouping, you have a relation D with fields "group"
and "C". So when you say "foreach D generate" you are iterating over
pairs (group, C). You can call a udf on group, on C, or on *.

-D

On Mon, Jan 18, 2010 at 12:44 AM, Pallavi Palleti
<pa...@corp.aol.com> wrote:
> If you want to apply UDF on each  record generated after group by, then you
> need to pass C in the UDF and typecast the first field of tuple to databag
> in exec(Tuple) method.
>
> Ex:
>
> E = FOREACH D GENERATE FLATTEN(my.UDF(C));
>
>
> <your return type> exec(Tuple t)
> {
>    DataBag yourNameBag = (DataBag) tuple.get(0);    .......
> }
>
> Thanks
> Pallavi
>
> Anthony Urso wrote:
>>
>> I can't figure out how to run a UDF on the result of "GROUP BY" from
>> the current documentation.
>>
>> I'd like to do something along these lines:
>>
>> A = LOAD 'A';
>> B = LOAD 'B';
>> C = JOIN A BY $0, B by $0;
>> D = GROUP C BY A::$0;
>> E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
>> STORE E IN 'E';
>>
>> Specifically, what goes in the UDF call?  I've tried to use "D" there,
>> but it errors out claiming a bad alias.  Using "C" compiles, but does
>> not seem right to me.
>>
>> Also, what is actually passed to the UDF.exec method in this case?
>>
>> Thanks,
>> Anthony
>>
>

Re: How to run a UDF on the result of GROUP BY

Posted by Pallavi Palleti <pa...@corp.aol.com>.

If you want to apply UDF on each  record generated after group by, then 
you need to pass C in the UDF and typecast the first field of tuple to 
databag in exec(Tuple) method.

Ex:

E = FOREACH D GENERATE FLATTEN(my.UDF(C));


 <your return type> exec(Tuple t)
{
     DataBag yourNameBag = (DataBag) tuple.get(0); 
     .......
}

Thanks
Pallavi

Anthony Urso wrote:
> I can't figure out how to run a UDF on the result of "GROUP BY" from
> the current documentation.
>
> I'd like to do something along these lines:
>
> A = LOAD 'A';
> B = LOAD 'B';
> C = JOIN A BY $0, B by $0;
> D = GROUP C BY A::$0;
> E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
> STORE E IN 'E';
>
> Specifically, what goes in the UDF call?  I've tried to use "D" there,
> but it errors out claiming a bad alias.  Using "C" compiles, but does
> not seem right to me.
>
> Also, what is actually passed to the UDF.exec method in this case?
>
> Thanks,
> Anthony
>

Re: How to run a UDF on the result of GROUP BY

Posted by Rekha Joshi <re...@yahoo-inc.com>.

Depends on how you have coded your UDF.For some examples on how to work with UDF refer http://hadoop.apache.org/pig/docs/r0.5.0/udf.html

Cheers,
/R


On 1/18/10 1:23 PM, "Anthony Urso" <an...@cs.ucla.edu> wrote:

I can't figure out how to run a UDF on the result of "GROUP BY" from
the current documentation.

I'd like to do something along these lines:

A = LOAD 'A';
B = LOAD 'B';
C = JOIN A BY $0, B by $0;
D = GROUP C BY A::$0;
E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
STORE E IN 'E';

Specifically, what goes in the UDF call?  I've tried to use "D" there,
but it errors out claiming a bad alias.  Using "C" compiles, but does
not seem right to me.

Also, what is actually passed to the UDF.exec method in this case?

Thanks,
Anthony