You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Anthony Urso <an...@cs.ucla.edu> on 2010/01/18 08:53:57 UTC
How to run a UDF on the result of GROUP BY
I can't figure out how to run a UDF on the result of "GROUP BY" from
the current documentation.
I'd like to do something along these lines:
A = LOAD 'A';
B = LOAD 'B';
C = JOIN A BY $0, B by $0;
D = GROUP C BY A::$0;
E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
STORE E IN 'E';
Specifically, what goes in the UDF call? I've tried to use "D" there,
but it errors out claiming a bad alias. Using "C" compiles, but does
not seem right to me.
Also, what is actually passed to the UDF.exec method in this case?
Thanks,
Anthony
Re: How to run a UDF on the result of GROUP BY
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Anthony,
What's happening is that a UDF gets called on fields, not on the whole
relation. After grouping, you have a relation D with fields "group"
and "C". So when you say "foreach D generate" you are iterating over
pairs (group, C). You can call a udf on group, on C, or on *.
-D
On Mon, Jan 18, 2010 at 12:44 AM, Pallavi Palleti
<pa...@corp.aol.com> wrote:
> If you want to apply UDF on each record generated after group by, then you
> need to pass C in the UDF and typecast the first field of tuple to databag
> in exec(Tuple) method.
>
> Ex:
>
> E = FOREACH D GENERATE FLATTEN(my.UDF(C));
>
>
> <your return type> exec(Tuple t)
> {
> DataBag yourNameBag = (DataBag) tuple.get(0); .......
> }
>
> Thanks
> Pallavi
>
> Anthony Urso wrote:
>>
>> I can't figure out how to run a UDF on the result of "GROUP BY" from
>> the current documentation.
>>
>> I'd like to do something along these lines:
>>
>> A = LOAD 'A';
>> B = LOAD 'B';
>> C = JOIN A BY $0, B by $0;
>> D = GROUP C BY A::$0;
>> E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
>> STORE E IN 'E';
>>
>> Specifically, what goes in the UDF call? I've tried to use "D" there,
>> but it errors out claiming a bad alias. Using "C" compiles, but does
>> not seem right to me.
>>
>> Also, what is actually passed to the UDF.exec method in this case?
>>
>> Thanks,
>> Anthony
>>
>
Re: How to run a UDF on the result of GROUP BY
Posted by Pallavi Palleti <pa...@corp.aol.com>.
If you want to apply UDF on each record generated after group by, then
you need to pass C in the UDF and typecast the first field of tuple to
databag in exec(Tuple) method.
Ex:
E = FOREACH D GENERATE FLATTEN(my.UDF(C));
<your return type> exec(Tuple t)
{
DataBag yourNameBag = (DataBag) tuple.get(0);
.......
}
Thanks
Pallavi
Anthony Urso wrote:
> I can't figure out how to run a UDF on the result of "GROUP BY" from
> the current documentation.
>
> I'd like to do something along these lines:
>
> A = LOAD 'A';
> B = LOAD 'B';
> C = JOIN A BY $0, B by $0;
> D = GROUP C BY A::$0;
> E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
> STORE E IN 'E';
>
> Specifically, what goes in the UDF call? I've tried to use "D" there,
> but it errors out claiming a bad alias. Using "C" compiles, but does
> not seem right to me.
>
> Also, what is actually passed to the UDF.exec method in this case?
>
> Thanks,
> Anthony
>
Re: How to run a UDF on the result of GROUP BY
Posted by Rekha Joshi <re...@yahoo-inc.com>.
Depends on how you have coded your UDF.For some examples on how to work with UDF refer http://hadoop.apache.org/pig/docs/r0.5.0/udf.html
Cheers,
/R
On 1/18/10 1:23 PM, "Anthony Urso" <an...@cs.ucla.edu> wrote:
I can't figure out how to run a UDF on the result of "GROUP BY" from
the current documentation.
I'd like to do something along these lines:
A = LOAD 'A';
B = LOAD 'B';
C = JOIN A BY $0, B by $0;
D = GROUP C BY A::$0;
E = FOREACH D GENERATE FLATTEN(my.UDF(??????));
STORE E IN 'E';
Specifically, what goes in the UDF call? I've tried to use "D" there,
but it errors out claiming a bad alias. Using "C" compiles, but does
not seem right to me.
Also, what is actually passed to the UDF.exec method in this case?
Thanks,
Anthony