You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Benoit Mathieu <bm...@deezer.com> on 2012/07/25 18:32:26 UTC

when Algebraic UDF are used ?

Hi pig users,

I have coded my own algebraic UDF in Java, and it seems that pig do not use
the algebraic interface at all. (I put some log messages in my
Initial,Intermed and Final functions, and they re never logged).
Pig uses only the main "exec" function.

My UDF needs to get the bag sorted.
Here is my pig script:

A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
B = GROUP A BY k1;
C = FOREACH B {
  tmp = ORDER A.(k2,value) BY k2;
  GENERATE group, MyUDF(tmp);
}
...


Does anyone know why pig does not use the algebraic interface ?

thanks,

Benoit

Re: when Algebraic UDF are used ?

Posted by pablomar <pa...@gmail.com>.
side note: sorry if it sounded bad. it is not RTFM response. I've just sent
you the better explanation I could. And that book explain it better than I
can


On Wed, Jul 25, 2012 at 1:21 PM, pablomar
<pa...@gmail.com>wrote:

> from the same book (
> http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html)
>
> "Memory Issues in Eval Funcs
>
> Some operations you will do in your UDFs will require more memory than is
> available. As an example you may want to build a UDF that calculates the
> cumulative sum of a set of inputs. This will return a bag of values since
> for each input it needs to return the intermediate sum at that input.
>
> Pig's bags handle spilling data to disk automatically when they pass a
> certain size threshold, or when only a certain amount of heap space
> remains. Spilling to disk is expensive, and whenever possible should be
> avoided. But if you must store large amounts of data in a bag, Pig will
> manage it.
>
> Bags are the only Pig datatype that know how to spill. Tuple and maps must
> fit into memory. Bags that are too large to fit in memory can still be
> referenced in a tuple or a map. This will not be counted as those tuples or
> maps not fitting into memory"
>
>
>
>
> On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <bm...@deezer.com> wrote:
>
>> Thanks for your answers.
>>
>> So, I have further questions.
>> Sorting the bag myself in my UDF whould solve my problem, but I don't know
>> what happen with bags that does not fit in memory.
>> How does Pig manage large bags ? How are they passed to UDF ?
>>
>> ++
>> benoit
>>
>>
>> 2012/7/25 Alan Gates <ga...@hortonworks.com>
>>
>> > It can't use the algebraic interface in this case because the data has
>> to
>> > be sorted (which means it has to see all the data) before passing it to
>> > your UDF.  If you remove the ORDER statement then the algebraic portion
>> of
>> > your UDF will be invoked.
>> >
>> > Alan.
>> >
>> > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:
>> >
>> > > Hi pig users,
>> > >
>> > > I have coded my own algebraic UDF in Java, and it seems that pig do
>> not
>> > use
>> > > the algebraic interface at all. (I put some log messages in my
>> > > Initial,Intermed and Final functions, and they re never logged).
>> > > Pig uses only the main "exec" function.
>> > >
>> > > My UDF needs to get the bag sorted.
>> > > Here is my pig script:
>> > >
>> > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
>> > > B = GROUP A BY k1;
>> > > C = FOREACH B {
>> > >  tmp = ORDER A.(k2,value) BY k2;
>> > >  GENERATE group, MyUDF(tmp);
>> > > }
>> > > ...
>> > >
>> > >
>> > > Does anyone know why pig does not use the algebraic interface ?
>> > >
>> > > thanks,
>> > >
>> > > Benoit
>> >
>> >
>>
>
>

Re: when Algebraic UDF are used ?

Posted by Benoit Mathieu <bm...@deezer.com>.
Thanks !

++
benoit

2012/7/25 pablomar <pa...@gmail.com>

> from the same book (
> http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html)
>
> "Memory Issues in Eval Funcs
>
> Some operations you will do in your UDFs will require more memory than is
> available. As an example you may want to build a UDF that calculates the
> cumulative sum of a set of inputs. This will return a bag of values since
> for each input it needs to return the intermediate sum at that input.
>
> Pig's bags handle spilling data to disk automatically when they pass a
> certain size threshold, or when only a certain amount of heap space
> remains. Spilling to disk is expensive, and whenever possible should be
> avoided. But if you must store large amounts of data in a bag, Pig will
> manage it.
>
> Bags are the only Pig datatype that know how to spill. Tuple and maps must
> fit into memory. Bags that are too large to fit in memory can still be
> referenced in a tuple or a map. This will not be counted as those tuples or
> maps not fitting into memory"
>
>
>
> On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <bm...@deezer.com> wrote:
>
> > Thanks for your answers.
> >
> > So, I have further questions.
> > Sorting the bag myself in my UDF whould solve my problem, but I don't
> know
> > what happen with bags that does not fit in memory.
> > How does Pig manage large bags ? How are they passed to UDF ?
> >
> > ++
> > benoit
> >
> >
> > 2012/7/25 Alan Gates <ga...@hortonworks.com>
> >
> > > It can't use the algebraic interface in this case because the data has
> to
> > > be sorted (which means it has to see all the data) before passing it to
> > > your UDF.  If you remove the ORDER statement then the algebraic portion
> > of
> > > your UDF will be invoked.
> > >
> > > Alan.
> > >
> > > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:
> > >
> > > > Hi pig users,
> > > >
> > > > I have coded my own algebraic UDF in Java, and it seems that pig do
> not
> > > use
> > > > the algebraic interface at all. (I put some log messages in my
> > > > Initial,Intermed and Final functions, and they re never logged).
> > > > Pig uses only the main "exec" function.
> > > >
> > > > My UDF needs to get the bag sorted.
> > > > Here is my pig script:
> > > >
> > > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> > > > B = GROUP A BY k1;
> > > > C = FOREACH B {
> > > >  tmp = ORDER A.(k2,value) BY k2;
> > > >  GENERATE group, MyUDF(tmp);
> > > > }
> > > > ...
> > > >
> > > >
> > > > Does anyone know why pig does not use the algebraic interface ?
> > > >
> > > > thanks,
> > > >
> > > > Benoit
> > >
> > >
> >
>

Re: when Algebraic UDF are used ?

Posted by pablomar <pa...@gmail.com>.
from the same book (
http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html)

"Memory Issues in Eval Funcs

Some operations you will do in your UDFs will require more memory than is
available. As an example you may want to build a UDF that calculates the
cumulative sum of a set of inputs. This will return a bag of values since
for each input it needs to return the intermediate sum at that input.

Pig's bags handle spilling data to disk automatically when they pass a
certain size threshold, or when only a certain amount of heap space
remains. Spilling to disk is expensive, and whenever possible should be
avoided. But if you must store large amounts of data in a bag, Pig will
manage it.

Bags are the only Pig datatype that know how to spill. Tuple and maps must
fit into memory. Bags that are too large to fit in memory can still be
referenced in a tuple or a map. This will not be counted as those tuples or
maps not fitting into memory"



On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <bm...@deezer.com> wrote:

> Thanks for your answers.
>
> So, I have further questions.
> Sorting the bag myself in my UDF whould solve my problem, but I don't know
> what happen with bags that does not fit in memory.
> How does Pig manage large bags ? How are they passed to UDF ?
>
> ++
> benoit
>
>
> 2012/7/25 Alan Gates <ga...@hortonworks.com>
>
> > It can't use the algebraic interface in this case because the data has to
> > be sorted (which means it has to see all the data) before passing it to
> > your UDF.  If you remove the ORDER statement then the algebraic portion
> of
> > your UDF will be invoked.
> >
> > Alan.
> >
> > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:
> >
> > > Hi pig users,
> > >
> > > I have coded my own algebraic UDF in Java, and it seems that pig do not
> > use
> > > the algebraic interface at all. (I put some log messages in my
> > > Initial,Intermed and Final functions, and they re never logged).
> > > Pig uses only the main "exec" function.
> > >
> > > My UDF needs to get the bag sorted.
> > > Here is my pig script:
> > >
> > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> > > B = GROUP A BY k1;
> > > C = FOREACH B {
> > >  tmp = ORDER A.(k2,value) BY k2;
> > >  GENERATE group, MyUDF(tmp);
> > > }
> > > ...
> > >
> > >
> > > Does anyone know why pig does not use the algebraic interface ?
> > >
> > > thanks,
> > >
> > > Benoit
> >
> >
>

Re: when Algebraic UDF are used ?

Posted by Benoit Mathieu <bm...@deezer.com>.
Thanks for your answers.

So, I have further questions.
Sorting the bag myself in my UDF whould solve my problem, but I don't know
what happen with bags that does not fit in memory.
How does Pig manage large bags ? How are they passed to UDF ?

++
benoit


2012/7/25 Alan Gates <ga...@hortonworks.com>

> It can't use the algebraic interface in this case because the data has to
> be sorted (which means it has to see all the data) before passing it to
> your UDF.  If you remove the ORDER statement then the algebraic portion of
> your UDF will be invoked.
>
> Alan.
>
> On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:
>
> > Hi pig users,
> >
> > I have coded my own algebraic UDF in Java, and it seems that pig do not
> use
> > the algebraic interface at all. (I put some log messages in my
> > Initial,Intermed and Final functions, and they re never logged).
> > Pig uses only the main "exec" function.
> >
> > My UDF needs to get the bag sorted.
> > Here is my pig script:
> >
> > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> > B = GROUP A BY k1;
> > C = FOREACH B {
> >  tmp = ORDER A.(k2,value) BY k2;
> >  GENERATE group, MyUDF(tmp);
> > }
> > ...
> >
> >
> > Does anyone know why pig does not use the algebraic interface ?
> >
> > thanks,
> >
> > Benoit
>
>

Re: when Algebraic UDF are used ?

Posted by Alan Gates <ga...@hortonworks.com>.
It can't use the algebraic interface in this case because the data has to be sorted (which means it has to see all the data) before passing it to your UDF.  If you remove the ORDER statement then the algebraic portion of your UDF will be invoked.

Alan.

On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:

> Hi pig users,
> 
> I have coded my own algebraic UDF in Java, and it seems that pig do not use
> the algebraic interface at all. (I put some log messages in my
> Initial,Intermed and Final functions, and they re never logged).
> Pig uses only the main "exec" function.
> 
> My UDF needs to get the bag sorted.
> Here is my pig script:
> 
> A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> B = GROUP A BY k1;
> C = FOREACH B {
>  tmp = ORDER A.(k2,value) BY k2;
>  GENERATE group, MyUDF(tmp);
> }
> ...
> 
> 
> Does anyone know why pig does not use the algebraic interface ?
> 
> thanks,
> 
> Benoit


Re: when Algebraic UDF are used ?

Posted by pablomar <pa...@gmail.com>.
according to: http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html

"Implementing Algebraic does not guarantee that the algebraic
implementation will always be used. Pig only chooses the algebraic
implementation if all UDFs in the same foreach statement are algebraic.
This is because our testing has shown that using the combiner with data
that cannot be combined significantly slows down the job. And there is no
way in Hadoop to route some data to the combiner (for algebraic functions)
and some straight to the reducer (for non-algebraic). This means that your
UDF must always implement the exec method, even if you hope it will always
be used in the algebraic mode. It is also an additional motivation to
implement algebraic for your UDFs when possible."


On Wed, Jul 25, 2012 at 12:32 PM, Benoit Mathieu <bm...@deezer.com> wrote:

> Hi pig users,
>
> I have coded my own algebraic UDF in Java, and it seems that pig do not use
> the algebraic interface at all. (I put some log messages in my
> Initial,Intermed and Final functions, and they re never logged).
> Pig uses only the main "exec" function.
>
> My UDF needs to get the bag sorted.
> Here is my pig script:
>
> A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> B = GROUP A BY k1;
> C = FOREACH B {
>   tmp = ORDER A.(k2,value) BY k2;
>   GENERATE group, MyUDF(tmp);
> }
> ...
>
>
> Does anyone know why pig does not use the algebraic interface ?
>
> thanks,
>
> Benoit
>

Re: when Algebraic UDF are used ?

Posted by Benoit Mathieu <bm...@deezer.com>.
I'm using pig 0.9.2 from CDH4 packaging.

++
benoit

2012/7/25 Benoit Mathieu <bm...@deezer.com>

> Hi pig users,
>
> I have coded my own algebraic UDF in Java, and it seems that pig do not
> use the algebraic interface at all. (I put some log messages in my
> Initial,Intermed and Final functions, and they re never logged).
> Pig uses only the main "exec" function.
>
> My UDF needs to get the bag sorted.
> Here is my pig script:
>
> A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> B = GROUP A BY k1;
> C = FOREACH B {
>   tmp = ORDER A.(k2,value) BY k2;
>   GENERATE group, MyUDF(tmp);
> }
> ...
>
>
> Does anyone know why pig does not use the algebraic interface ?
>
> thanks,
>
> Benoit
>