You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dmitriy Lyubimov <dl...@apache.org> on 2011/04/16 03:20:49 UTC

Algebraic UDF with one bag and one non-bag parameter

Hi,

is it possible to create an aggregating function with 2 parameters one
of which is bag and another one is not?

In particular, i want to use that to work around lack of function
invocation instance configuration.

say I have a function that can aggregate over some period of history ,
say, aggregate (days, sampleBag).

sampleBag is a bag of tuples in a form (value,time). So i want to use
it multiple times in the same script to aggregate exponentially over
30 days and another invocation instance to aggregate the same bag over
7 days. Exponential scale depends on this time parameter. So i want to
use it in something like

B = foreach A generate agregate(30,sampleBag) as 30daysAggregate,
aggregate(7,sampleBag) as 7daysAggregate.

Question 1 -- is it even valid format for a function implementing Algebraic?
Question 2 -- would i be also able to use Accumulator interface ?

If not, how can I parameterize invocations? I know udf manual says i
really can't so if the above is the way it is, it would really be
very, very sad. I would really hate to create versions such as
aggregate7days(bag),aggregate30days(bag)...?

Thanks.
-d

Re: Algebraic UDF with one bag and one non-bag parameter

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
>
> DEFINE func mypackage.myfunc(parameter);

Thanks! this is so cool. Holy grail, literary. i think this was not
available at least in 0.6? Since when is this available for eval
funcs?


>
> So you could also instantiate 2 versions.
>
> 2011/4/15 Dmitriy Lyubimov <dl...@apache.org>
>
>> Hi,
>>
>> is it possible to create an aggregating function with 2 parameters one
>> of which is bag and another one is not?
>>
>> In particular, i want to use that to work around lack of function
>> invocation instance configuration.
>>
>> say I have a function that can aggregate over some period of history ,
>> say, aggregate (days, sampleBag).
>>
>> sampleBag is a bag of tuples in a form (value,time). So i want to use
>> it multiple times in the same script to aggregate exponentially over
>> 30 days and another invocation instance to aggregate the same bag over
>> 7 days. Exponential scale depends on this time parameter. So i want to
>> use it in something like
>>
>> B = foreach A generate agregate(30,sampleBag) as 30daysAggregate,
>> aggregate(7,sampleBag) as 7daysAggregate.
>>
>> Question 1 -- is it even valid format for a function implementing
>> Algebraic?
>> Question 2 -- would i be also able to use Accumulator interface ?
>>
>> If not, how can I parameterize invocations? I know udf manual says i
>> really can't so if the above is the way it is, it would really be
>> very, very sad. I would really hate to create versions such as
>> aggregate7days(bag),aggregate30days(bag)...?
>>
>> Thanks.
>> -d
>>
>

Re: Algebraic UDF with one bag and one non-bag parameter

Posted by Jonathan Coveney <jc...@gmail.com>.
1) You absolutely can do what you want to do. Literally just make it the
second input, and in your script you'll have something like...
DataBag inbag = (DataBag)input.get(0); //the input
Whatever thing = (Whatever)input.get(1); //and so on

But beyond that, you can pass parameters to the constructor as so

DEFINE func mypackage.myfunc(parameter);

So you could also instantiate 2 versions.

2011/4/15 Dmitriy Lyubimov <dl...@apache.org>

> Hi,
>
> is it possible to create an aggregating function with 2 parameters one
> of which is bag and another one is not?
>
> In particular, i want to use that to work around lack of function
> invocation instance configuration.
>
> say I have a function that can aggregate over some period of history ,
> say, aggregate (days, sampleBag).
>
> sampleBag is a bag of tuples in a form (value,time). So i want to use
> it multiple times in the same script to aggregate exponentially over
> 30 days and another invocation instance to aggregate the same bag over
> 7 days. Exponential scale depends on this time parameter. So i want to
> use it in something like
>
> B = foreach A generate agregate(30,sampleBag) as 30daysAggregate,
> aggregate(7,sampleBag) as 7daysAggregate.
>
> Question 1 -- is it even valid format for a function implementing
> Algebraic?
> Question 2 -- would i be also able to use Accumulator interface ?
>
> If not, how can I parameterize invocations? I know udf manual says i
> really can't so if the above is the way it is, it would really be
> very, very sad. I would really hate to create versions such as
> aggregate7days(bag),aggregate30days(bag)...?
>
> Thanks.
> -d
>