You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2010/06/01 01:50:11 UTC

Re: UDF question

Asif,
You have to group tuples into a bag (in fact, COUNT does require this --
only DISTINCT doesn't, but that's because it's not a built-in function, but
a whole separate operator.. don't worry about it if that doesn't make
sense).
You may be able to avoid doing a group all depending on how you define your
periods by doing things like grouping on time truncated to an hour, etc.

-Dmitriy

On Mon, May 31, 2010 at 7:16 AM, Asif Jan <as...@gmail.com> wrote:

> Thanks,
>
> I was confused with the input to the exec method e.g.  Tuple. Now I
> understand that each object in tuple could be of simple or complex type.
>
> I have one more question though. The only way I was able to make my
> function work was:
>
> grunt>  ds = LOAD 'data/timeseries' using PigStorage('\t') as
> (times:double, mag1:double, err1:double, mag2:double, err2:double);
> grunt>  A = group ds all;
> grunt> B = foreach A {result = PeriodSearchFunc(ds); generate
> flatten(result);};
>
> e.g. I was forced to wrap it in a Bag and then use foreach. Is it possible
> to use it as follows:
>
>
> grunt>  ds = LOAD 'data/timeseries' using PigStorage('\t') as
> (times:double, mag1:double, err1:double, mag2:double, err2:double);
> grunt> B = PeriodSearchFunc(ds);
>
> (in the same manner as the DISTINCT or COUNT built-ins)
>
> thanks again
>
>
>
> On May 29, 2010, at 3:01 AM, Dmitriy Ryaboy wrote:
>
>  Sounds like you want an EvalFunc that returns a Bag of Tuples, with each
>> tuple having 2 fields. Pretty straightforward.
>> You don't have to implement the algebraic interface (or the accumulator
>> interface) -- those are optimizations for working with large datasets, and
>> not required for anything other than scalability.
>>
>> (hc -- chickens won't come out cause pig won't know how to serialize the
>> thing. You have to turn your chicken into a bytearray).
>>
>> -D
>>
>>
>> On Fri, May 28, 2010 at 5:29 PM, hc busy <hc...@gmail.com> wrote:
>>
>>  Couldn't you give EvalFunc<any return type> any return type? so you can
>>> just
>>> return a Bag that contains tuples of tuples, right? And it's easy because
>>> tuple is un parameterized type, (and so is Bag) so you'd declare
>>>
>>>
>>> class myUdf extends EvalFunc<Bag>{...}
>>>
>>> I haven't tried this, but some times I'm tempted to return something
>>> weird
>>> like
>>>
>>> EvalFunc<Chicken>
>>>
>>> and see chickens come out of pig. ;-) heheheheeee
>>>
>>>
>>> Anyways, in all seriousness, there is a UDF that converts data to bag
>>> (well,
>>> currently a contrib Udf, but may make into bultin) that I wrote called
>>> ToBag. here's the initial declaration for it:
>>>
>>> public class ToBag extends EvalFunc<DataBag>
>>>
>>>
>>> Your class would be declared similarly.
>>>
>>> On Fri, May 28, 2010 at 7:50 AM, Asif Jan <as...@gmail.com> wrote:
>>>
>>>  Hello
>>>>
>>>> I need some help to get started with using Pig UDF.
>>>>
>>>> I have time series data (time, magA, errA, magB, errB) e.g.
>>>>
>>>> (2345.59777,19.875,0.481,20.225,0.482)
>>>> (2347.59568,19.371,0.3,20.227,0.743)
>>>> (2351.6075,19.063,0.193,20.768,1.085)
>>>> (2354.59702,20.689,3.047,20.873,1.758)
>>>> (2356.63223,21.23,3.341,20.562,1.242)
>>>>
>>>>
>>>> and I need to apply an algorithm that searches for periods in the data.
>>>> The input to the algorithm is the  (time , magX, errX )  arrays. The
>>>>
>>> algo
>>>
>>>> returns a List of all periods found. Each entry in the List is a
>>>> (period_value , period_significance) pair.
>>>>
>>>>
>>>> How can I wrap that algo as UDF ?   do I have to use algebraic functions
>>>> (but I saw that they could only return scalar values ); what I need to
>>>> return from function is something like
>>>>
>>>> (1000.0,0.57)
>>>> (234, .45)
>>>> (100, 0.023)
>>>> (6, 0.003)
>>>>
>>>>
>>>> thanks a lot
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>