You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Asif Jan <as...@gmail.com> on 2010/05/28 16:50:06 UTC

UDF question

Hello

I need some help to get started with using Pig UDF.

I have time series data (time, magA, errA, magB, errB) e.g.

(2345.59777,19.875,0.481,20.225,0.482)
(2347.59568,19.371,0.3,20.227,0.743)
(2351.6075,19.063,0.193,20.768,1.085)
(2354.59702,20.689,3.047,20.873,1.758)
(2356.63223,21.23,3.341,20.562,1.242)


and I need to apply an algorithm that searches for periods in the  
data.  The input to the algorithm is the  (time , magX, errX )   
arrays. The algo returns a List of all periods found. Each entry in  
the List is a (period_value , period_significance) pair.


How can I wrap that algo as UDF ?   do I have to use algebraic  
functions (but I saw that they could only return scalar values ); what  
I need to return from function is something like

(1000.0,0.57)
(234, .45)
(100, 0.023)
(6, 0.003)


thanks a lot

map join

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all,
does the map join in pig use distributed cache by default? Can this be changed by user? Thanks.

-Gang

Re: UDF question

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Asif,
You have to group tuples into a bag (in fact, COUNT does require this --
only DISTINCT doesn't, but that's because it's not a built-in function, but
a whole separate operator.. don't worry about it if that doesn't make
sense).
You may be able to avoid doing a group all depending on how you define your
periods by doing things like grouping on time truncated to an hour, etc.

-Dmitriy

On Mon, May 31, 2010 at 7:16 AM, Asif Jan <as...@gmail.com> wrote:

> Thanks,
>
> I was confused with the input to the exec method e.g.  Tuple. Now I
> understand that each object in tuple could be of simple or complex type.
>
> I have one more question though. The only way I was able to make my
> function work was:
>
> grunt>  ds = LOAD 'data/timeseries' using PigStorage('\t') as
> (times:double, mag1:double, err1:double, mag2:double, err2:double);
> grunt>  A = group ds all;
> grunt> B = foreach A {result = PeriodSearchFunc(ds); generate
> flatten(result);};
>
> e.g. I was forced to wrap it in a Bag and then use foreach. Is it possible
> to use it as follows:
>
>
> grunt>  ds = LOAD 'data/timeseries' using PigStorage('\t') as
> (times:double, mag1:double, err1:double, mag2:double, err2:double);
> grunt> B = PeriodSearchFunc(ds);
>
> (in the same manner as the DISTINCT or COUNT built-ins)
>
> thanks again
>
>
>
> On May 29, 2010, at 3:01 AM, Dmitriy Ryaboy wrote:
>
>  Sounds like you want an EvalFunc that returns a Bag of Tuples, with each
>> tuple having 2 fields. Pretty straightforward.
>> You don't have to implement the algebraic interface (or the accumulator
>> interface) -- those are optimizations for working with large datasets, and
>> not required for anything other than scalability.
>>
>> (hc -- chickens won't come out cause pig won't know how to serialize the
>> thing. You have to turn your chicken into a bytearray).
>>
>> -D
>>
>>
>> On Fri, May 28, 2010 at 5:29 PM, hc busy <hc...@gmail.com> wrote:
>>
>>  Couldn't you give EvalFunc<any return type> any return type? so you can
>>> just
>>> return a Bag that contains tuples of tuples, right? And it's easy because
>>> tuple is un parameterized type, (and so is Bag) so you'd declare
>>>
>>>
>>> class myUdf extends EvalFunc<Bag>{...}
>>>
>>> I haven't tried this, but some times I'm tempted to return something
>>> weird
>>> like
>>>
>>> EvalFunc<Chicken>
>>>
>>> and see chickens come out of pig. ;-) heheheheeee
>>>
>>>
>>> Anyways, in all seriousness, there is a UDF that converts data to bag
>>> (well,
>>> currently a contrib Udf, but may make into bultin) that I wrote called
>>> ToBag. here's the initial declaration for it:
>>>
>>> public class ToBag extends EvalFunc<DataBag>
>>>
>>>
>>> Your class would be declared similarly.
>>>
>>> On Fri, May 28, 2010 at 7:50 AM, Asif Jan <as...@gmail.com> wrote:
>>>
>>>  Hello
>>>>
>>>> I need some help to get started with using Pig UDF.
>>>>
>>>> I have time series data (time, magA, errA, magB, errB) e.g.
>>>>
>>>> (2345.59777,19.875,0.481,20.225,0.482)
>>>> (2347.59568,19.371,0.3,20.227,0.743)
>>>> (2351.6075,19.063,0.193,20.768,1.085)
>>>> (2354.59702,20.689,3.047,20.873,1.758)
>>>> (2356.63223,21.23,3.341,20.562,1.242)
>>>>
>>>>
>>>> and I need to apply an algorithm that searches for periods in the data.
>>>> The input to the algorithm is the  (time , magX, errX )  arrays. The
>>>>
>>> algo
>>>
>>>> returns a List of all periods found. Each entry in the List is a
>>>> (period_value , period_significance) pair.
>>>>
>>>>
>>>> How can I wrap that algo as UDF ?   do I have to use algebraic functions
>>>> (but I saw that they could only return scalar values ); what I need to
>>>> return from function is something like
>>>>
>>>> (1000.0,0.57)
>>>> (234, .45)
>>>> (100, 0.023)
>>>> (6, 0.003)
>>>>
>>>>
>>>> thanks a lot
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: UDF question

Posted by Asif Jan <as...@gmail.com>.

Thanks,

I was confused with the input to the exec method e.g.  Tuple. Now I  
understand that each object in tuple could be of simple or complex type.

I have one more question though. The only way I was able to make my  
function work was:

grunt>  ds = LOAD 'data/timeseries' using PigStorage('\t') as  
(times:double, mag1:double, err1:double, mag2:double, err2:double);
grunt>  A = group ds all;
grunt> B = foreach A {result = PeriodSearchFunc(ds); generate  
flatten(result);};

e.g. I was forced to wrap it in a Bag and then use foreach. Is it  
possible to use it as follows:


grunt>  ds = LOAD 'data/timeseries' using PigStorage('\t') as  
(times:double, mag1:double, err1:double, mag2:double, err2:double);
grunt> B = PeriodSearchFunc(ds);

(in the same manner as the DISTINCT or COUNT built-ins)

thanks again


On May 29, 2010, at 3:01 AM, Dmitriy Ryaboy wrote:

> Sounds like you want an EvalFunc that returns a Bag of Tuples, with  
> each
> tuple having 2 fields. Pretty straightforward.
> You don't have to implement the algebraic interface (or the  
> accumulator
> interface) -- those are optimizations for working with large  
> datasets, and
> not required for anything other than scalability.
>
> (hc -- chickens won't come out cause pig won't know how to serialize  
> the
> thing. You have to turn your chicken into a bytearray).
>
> -D
>
>
> On Fri, May 28, 2010 at 5:29 PM, hc busy <hc...@gmail.com> wrote:
>
>> Couldn't you give EvalFunc<any return type> any return type? so you  
>> can
>> just
>> return a Bag that contains tuples of tuples, right? And it's easy  
>> because
>> tuple is un parameterized type, (and so is Bag) so you'd declare
>>
>>
>> class myUdf extends EvalFunc<Bag>{...}
>>
>> I haven't tried this, but some times I'm tempted to return  
>> something weird
>> like
>>
>> EvalFunc<Chicken>
>>
>> and see chickens come out of pig. ;-) heheheheeee
>>
>>
>> Anyways, in all seriousness, there is a UDF that converts data to bag
>> (well,
>> currently a contrib Udf, but may make into bultin) that I wrote  
>> called
>> ToBag. here's the initial declaration for it:
>>
>> public class ToBag extends EvalFunc<DataBag>
>>
>>
>> Your class would be declared similarly.
>>
>> On Fri, May 28, 2010 at 7:50 AM, Asif Jan <as...@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I need some help to get started with using Pig UDF.
>>>
>>> I have time series data (time, magA, errA, magB, errB) e.g.
>>>
>>> (2345.59777,19.875,0.481,20.225,0.482)
>>> (2347.59568,19.371,0.3,20.227,0.743)
>>> (2351.6075,19.063,0.193,20.768,1.085)
>>> (2354.59702,20.689,3.047,20.873,1.758)
>>> (2356.63223,21.23,3.341,20.562,1.242)
>>>
>>>
>>> and I need to apply an algorithm that searches for periods in the  
>>> data.
>>> The input to the algorithm is the  (time , magX, errX )  arrays. The
>> algo
>>> returns a List of all periods found. Each entry in the List is a
>>> (period_value , period_significance) pair.
>>>
>>>
>>> How can I wrap that algo as UDF ?   do I have to use algebraic  
>>> functions
>>> (but I saw that they could only return scalar values ); what I  
>>> need to
>>> return from function is something like
>>>
>>> (1000.0,0.57)
>>> (234, .45)
>>> (100, 0.023)
>>> (6, 0.003)
>>>
>>>
>>> thanks a lot
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>

Re: UDF question

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Sounds like you want an EvalFunc that returns a Bag of Tuples, with each
tuple having 2 fields. Pretty straightforward.
You don't have to implement the algebraic interface (or the accumulator
interface) -- those are optimizations for working with large datasets, and
not required for anything other than scalability.

(hc -- chickens won't come out cause pig won't know how to serialize the
thing. You have to turn your chicken into a bytearray).

-D


On Fri, May 28, 2010 at 5:29 PM, hc busy <hc...@gmail.com> wrote:

> Couldn't you give EvalFunc<any return type> any return type? so you can
> just
> return a Bag that contains tuples of tuples, right? And it's easy because
> tuple is un parameterized type, (and so is Bag) so you'd declare
>
>
> class myUdf extends EvalFunc<Bag>{...}
>
> I haven't tried this, but some times I'm tempted to return something weird
> like
>
> EvalFunc<Chicken>
>
> and see chickens come out of pig. ;-) heheheheeee
>
>
> Anyways, in all seriousness, there is a UDF that converts data to bag
> (well,
> currently a contrib Udf, but may make into bultin) that I wrote called
> ToBag. here's the initial declaration for it:
>
> public class ToBag extends EvalFunc<DataBag>
>
>
> Your class would be declared similarly.
>
> On Fri, May 28, 2010 at 7:50 AM, Asif Jan <as...@gmail.com> wrote:
>
> > Hello
> >
> > I need some help to get started with using Pig UDF.
> >
> > I have time series data (time, magA, errA, magB, errB) e.g.
> >
> > (2345.59777,19.875,0.481,20.225,0.482)
> > (2347.59568,19.371,0.3,20.227,0.743)
> > (2351.6075,19.063,0.193,20.768,1.085)
> > (2354.59702,20.689,3.047,20.873,1.758)
> > (2356.63223,21.23,3.341,20.562,1.242)
> >
> >
> > and I need to apply an algorithm that searches for periods in the data.
> >  The input to the algorithm is the  (time , magX, errX )  arrays. The
> algo
> > returns a List of all periods found. Each entry in the List is a
> > (period_value , period_significance) pair.
> >
> >
> > How can I wrap that algo as UDF ?   do I have to use algebraic functions
> > (but I saw that they could only return scalar values ); what I need to
> > return from function is something like
> >
> > (1000.0,0.57)
> > (234, .45)
> > (100, 0.023)
> > (6, 0.003)
> >
> >
> > thanks a lot
> >
> >
> >
> >
> >
> >
> >
>

Re: UDF question

Posted by hc busy <hc...@gmail.com>.

oh, that's a good point, can't just return arbitrary types... Even if I
derive from base class. Interesting.

Well, the combination of toTuple and toBag will accomplish many tasks. One
thing that I had to do is to collapse three columns into one row. (you won't
believe how many companies have legacy database like this or how much money
flows through this kind of systems out there. ;-)

So I do

FOREACH input_table GENERATE k0, f0, toBag(toTuple(k1, column_1),toTuple(k1,
column_2), toTuple(f1, column3));

And this get's me where I needed to be. It's similar to what asif was asking
except he wants to be doing more complicated combination inside his UDF. But
if it's time series, wouldn't we get where we need to be with a group and
order by?

I'd like to mention again, I'd really like to see nested foreach, group,
cross, and union be allowed into the set of nested_op inside foreach.

On Fri, May 28, 2010 at 5:29 PM, hc busy <hc...@gmail.com> wrote:

>
> Couldn't you give EvalFunc<any return type> any return type? so you can
> just return a Bag that contains tuples of tuples, right? And it's easy
> because tuple is un parameterized type, (and so is Bag) so you'd declare
>
>
> class myUdf extends EvalFunc<Bag>{...}
>
> I haven't tried this, but some times I'm tempted to return something weird
> like
>
> EvalFunc<Chicken>
>
> and see chickens come out of pig. ;-) heheheheeee
>
>
> Anyways, in all seriousness, there is a UDF that converts data to bag
> (well, currently a contrib Udf, but may make into bultin) that I wrote
> called ToBag. here's the initial declaration for it:
>
> public class ToBag extends EvalFunc<DataBag>
>
>
> Your class would be declared similarly.
>
> On Fri, May 28, 2010 at 7:50 AM, Asif Jan <as...@gmail.com> wrote:
>
>> Hello
>>
>> I need some help to get started with using Pig UDF.
>>
>> I have time series data (time, magA, errA, magB, errB) e.g.
>>
>> (2345.59777,19.875,0.481,20.225,0.482)
>> (2347.59568,19.371,0.3,20.227,0.743)
>> (2351.6075,19.063,0.193,20.768,1.085)
>> (2354.59702,20.689,3.047,20.873,1.758)
>> (2356.63223,21.23,3.341,20.562,1.242)
>>
>>
>> and I need to apply an algorithm that searches for periods in the data.
>>  The input to the algorithm is the  (time , magX, errX )  arrays. The algo
>> returns a List of all periods found. Each entry in the List is a
>> (period_value , period_significance) pair.
>>
>>
>> How can I wrap that algo as UDF ?   do I have to use algebraic functions
>> (but I saw that they could only return scalar values ); what I need to
>> return from function is something like
>>
>> (1000.0,0.57)
>> (234, .45)
>> (100, 0.023)
>> (6, 0.003)
>>
>>
>> thanks a lot
>>
>>
>>
>>
>>
>>
>>
>

Re: UDF question

Posted by hc busy <hc...@gmail.com>.

Couldn't you give EvalFunc<any return type> any return type? so you can just
return a Bag that contains tuples of tuples, right? And it's easy because
tuple is un parameterized type, (and so is Bag) so you'd declare

class myUdf extends EvalFunc<Bag>{...}

I haven't tried this, but some times I'm tempted to return something weird
like

EvalFunc<Chicken>

and see chickens come out of pig. ;-) heheheheeee

Anyways, in all seriousness, there is a UDF that converts data to bag (well,
currently a contrib Udf, but may make into bultin) that I wrote called
ToBag. here's the initial declaration for it:

public class ToBag extends EvalFunc<DataBag>

Your class would be declared similarly.

On Fri, May 28, 2010 at 7:50 AM, Asif Jan <as...@gmail.com> wrote:

> Hello
>
> I need some help to get started with using Pig UDF.
>
> I have time series data (time, magA, errA, magB, errB) e.g.
>
> (2345.59777,19.875,0.481,20.225,0.482)
> (2347.59568,19.371,0.3,20.227,0.743)
> (2351.6075,19.063,0.193,20.768,1.085)
> (2354.59702,20.689,3.047,20.873,1.758)
> (2356.63223,21.23,3.341,20.562,1.242)
>
>
> and I need to apply an algorithm that searches for periods in the data.
>  The input to the algorithm is the  (time , magX, errX )  arrays. The algo
> returns a List of all periods found. Each entry in the List is a
> (period_value , period_significance) pair.
>
>
> How can I wrap that algo as UDF ?   do I have to use algebraic functions
> (but I saw that they could only return scalar values ); what I need to
> return from function is something like
>
> (1000.0,0.57)
> (234, .45)
> (100, 0.023)
> (6, 0.003)
>
>
> thanks a lot
>
>
>
>
>
>
>