You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Pradeep Kamath <pr...@yahoo-inc.com> on 2008/12/15 19:52:09 UTC

Algebraic UDFs in Pig

Hi,

  Currently the Algebraic interface allows a UDF writer to have an
Initial, Intermediate and Final class (each of which should implement
EvalFunc). The idea is that the UDF can be called in stages -
Initial.exec() in the map, Intermediate.exec() in the combiner and
Final.exec() in the Reduce. The UDF (say COUNT) which implements
Algebraic, also extends EvalFunc. This means that it has an exec()
method. Currently Pig calls this exec() method at the top level when the
UDF is not "combinable". When it is "combinable", Pig currently calls
Initial.exec() in the combine and Final.exec() in the Reduce. I will be
changing the "combinable" case to call Initial.exec() in the map,
Intermediate.exec() in the combine and Final.exec() in the reduce as
part of https://issues.apache.org/jira/browse/PIG-563. 

 

There are two options for the Non combinable case: 

1)       The way it is described above - top level UDF's exec() is
called when combiner is not used and if combiner is used, Initial.exec()
is called in the map, Intermediate.exec() in the combine and
Final.exec() in the reduce.

*         Pros: 

a.       Initial.exec() can be optimized with the knowledge that it is
only called in the map. For example, in UDFs like COUNT, since
Initial.exec() is always going to be called in map, the implementation
can be optimized to simply emit Integer 1.

*         Cons: 

a.       UDF writer has to potentially write two different code paths -
one where UDF.exec() computes the result completely in the reduce() and
another where Initial.exec(), Intermediate.exec() and Final.exec()
compute the result in parts in the map, combine and reduce respectively.

 

2)       If a UDF implements Algebraic, Pig will have to guarantee that
Initial.exec() will be called and later Final.exec() will be called. If
the UDF is combinable, these will be called from map and reduce
respectively and Intermediate.exec() will be called from the combine. If
the UDF is NOT combinable, Initial.exec() will be called first in the
reduce, then its output will be put in a bag and supplied to a call of
the Final.exec(). In both the cases the top level exec() of the UDF will
never be called.

*         Pros: 

a.       The guarantee that Initial.exec() and Final.exec() are called
in both combinable and non combinable cases. 

*         Cons: 

a.       The UDF writer has to give a dummy implementation for
UDF.exec() to satisfy the EvalFunc interface though UDF.exec() is never
called. 

b.       UDF writer should make sure the Initial.exec() and Final.exec()
work in both the combinable and non combinable cases. 

c.       There are performance penalties - in the combinable case, the
Initial.exec() cannot be optimized since there is no guarantee that it
is always called in the map. In the non combinable case, the call to
Initial.exec() will contain all input and hence the result can be
computed in that call itself. Despite this, Pig will have to take the
result of Initial.exec(), put it in a bag and call Final.exec() which
can be highly inefficient.

 

I would vote for option 1 since it is much better from a performance
angle.

 

Please provide Comments/Suggestions on the proposal.

 

Thanks,

Pradeep

Re: Algebraic UDFs in Pig

Posted by Chris Olston <ol...@yahoo-inc.com>.

You could have the abstract class do:

> exec {
>    initial();
>    final();
> }

so that by default the code paths are the same, and let subclasses override
that method if they want to do a specific performance optimization.

-Chris


On 12/16/08 10:11 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

> +1 for 1.  We definitely want to enable the performance
> optimizations.  And the Con listed for one (double code
> implementations) is minimal in the cases where the writer isn't going
> to make performance optimizations because exec() can be done as:
> 
> exec {
>    initial();
>    final();
> }
> 
> This is a very minor burden.
> 
> Alan.
> 
> On Dec 15, 2008, at 10:52 AM, Pradeep Kamath wrote:
> 
>> Hi,
>> 
>>   Currently the Algebraic interface allows a UDF writer to have an
>> Initial, Intermediate and Final class (each of which should implement
>> EvalFunc). The idea is that the UDF can be called in stages -
>> Initial.exec() in the map, Intermediate.exec() in the combiner and
>> Final.exec() in the Reduce. The UDF (say COUNT) which implements
>> Algebraic, also extends EvalFunc. This means that it has an exec()
>> method. Currently Pig calls this exec() method at the top level
>> when the
>> UDF is not "combinable". When it is "combinable", Pig currently calls
>> Initial.exec() in the combine and Final.exec() in the Reduce. I
>> will be
>> changing the "combinable" case to call Initial.exec() in the map,
>> Intermediate.exec() in the combine and Final.exec() in the reduce as
>> part of https://issues.apache.org/jira/browse/PIG-563.
>> 
>> 
>> 
>> There are two options for the Non combinable case:
>> 
>> 1)       The way it is described above - top level UDF's exec() is
>> called when combiner is not used and if combiner is used,
>> Initial.exec()
>> is called in the map, Intermediate.exec() in the combine and
>> Final.exec() in the reduce.
>> 
>> *         Pros:
>> 
>> a.       Initial.exec() can be optimized with the knowledge that it is
>> only called in the map. For example, in UDFs like COUNT, since
>> Initial.exec() is always going to be called in map, the implementation
>> can be optimized to simply emit Integer 1.
>> 
>> *         Cons:
>> 
>> a.       UDF writer has to potentially write two different code
>> paths -
>> one where UDF.exec() computes the result completely in the reduce()
>> and
>> another where Initial.exec(), Intermediate.exec() and Final.exec()
>> compute the result in parts in the map, combine and reduce
>> respectively.
>> 
>> 
>> 
>> 2)       If a UDF implements Algebraic, Pig will have to guarantee
>> that
>> Initial.exec() will be called and later Final.exec() will be
>> called. If
>> the UDF is combinable, these will be called from map and reduce
>> respectively and Intermediate.exec() will be called from the
>> combine. If
>> the UDF is NOT combinable, Initial.exec() will be called first in the
>> reduce, then its output will be put in a bag and supplied to a call of
>> the Final.exec(). In both the cases the top level exec() of the UDF
>> will
>> never be called.
>> 
>> *         Pros:
>> 
>> a.       The guarantee that Initial.exec() and Final.exec() are called
>> in both combinable and non combinable cases.
>> 
>> *         Cons:
>> 
>> a.       The UDF writer has to give a dummy implementation for
>> UDF.exec() to satisfy the EvalFunc interface though UDF.exec() is
>> never
>> called.
>> 
>> b.       UDF writer should make sure the Initial.exec() and
>> Final.exec()
>> work in both the combinable and non combinable cases.
>> 
>> c.       There are performance penalties - in the combinable case, the
>> Initial.exec() cannot be optimized since there is no guarantee that it
>> is always called in the map. In the non combinable case, the call to
>> Initial.exec() will contain all input and hence the result can be
>> computed in that call itself. Despite this, Pig will have to take the
>> result of Initial.exec(), put it in a bag and call Final.exec() which
>> can be highly inefficient.
>> 
>> 
>> 
>> I would vote for option 1 since it is much better from a performance
>> angle.
>> 
>> 
>> 
>> Please provide Comments/Suggestions on the proposal.
>> 
>> 
>> 
>> Thanks,
>> 
>> Pradeep
>> 
>> 
>> 
> 

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: Algebraic UDFs in Pig

Posted by Alan Gates <ga...@yahoo-inc.com>.

+1 for 1.  We definitely want to enable the performance  
optimizations.  And the Con listed for one (double code  
implementations) is minimal in the cases where the writer isn't going  
to make performance optimizations because exec() can be done as:

exec {
   initial();
   final();
}

This is a very minor burden.

Alan.

On Dec 15, 2008, at 10:52 AM, Pradeep Kamath wrote:

> Hi,
>
>   Currently the Algebraic interface allows a UDF writer to have an
> Initial, Intermediate and Final class (each of which should implement
> EvalFunc). The idea is that the UDF can be called in stages -
> Initial.exec() in the map, Intermediate.exec() in the combiner and
> Final.exec() in the Reduce. The UDF (say COUNT) which implements
> Algebraic, also extends EvalFunc. This means that it has an exec()
> method. Currently Pig calls this exec() method at the top level  
> when the
> UDF is not "combinable". When it is "combinable", Pig currently calls
> Initial.exec() in the combine and Final.exec() in the Reduce. I  
> will be
> changing the "combinable" case to call Initial.exec() in the map,
> Intermediate.exec() in the combine and Final.exec() in the reduce as
> part of https://issues.apache.org/jira/browse/PIG-563.
>
>
>
> There are two options for the Non combinable case:
>
> 1)       The way it is described above - top level UDF's exec() is
> called when combiner is not used and if combiner is used,  
> Initial.exec()
> is called in the map, Intermediate.exec() in the combine and
> Final.exec() in the reduce.
>
> *         Pros:
>
> a.       Initial.exec() can be optimized with the knowledge that it is
> only called in the map. For example, in UDFs like COUNT, since
> Initial.exec() is always going to be called in map, the implementation
> can be optimized to simply emit Integer 1.
>
> *         Cons:
>
> a.       UDF writer has to potentially write two different code  
> paths -
> one where UDF.exec() computes the result completely in the reduce()  
> and
> another where Initial.exec(), Intermediate.exec() and Final.exec()
> compute the result in parts in the map, combine and reduce  
> respectively.
>
>
>
> 2)       If a UDF implements Algebraic, Pig will have to guarantee  
> that
> Initial.exec() will be called and later Final.exec() will be  
> called. If
> the UDF is combinable, these will be called from map and reduce
> respectively and Intermediate.exec() will be called from the  
> combine. If
> the UDF is NOT combinable, Initial.exec() will be called first in the
> reduce, then its output will be put in a bag and supplied to a call of
> the Final.exec(). In both the cases the top level exec() of the UDF  
> will
> never be called.
>
> *         Pros:
>
> a.       The guarantee that Initial.exec() and Final.exec() are called
> in both combinable and non combinable cases.
>
> *         Cons:
>
> a.       The UDF writer has to give a dummy implementation for
> UDF.exec() to satisfy the EvalFunc interface though UDF.exec() is  
> never
> called.
>
> b.       UDF writer should make sure the Initial.exec() and  
> Final.exec()
> work in both the combinable and non combinable cases.
>
> c.       There are performance penalties - in the combinable case, the
> Initial.exec() cannot be optimized since there is no guarantee that it
> is always called in the map. In the non combinable case, the call to
> Initial.exec() will contain all input and hence the result can be
> computed in that call itself. Despite this, Pig will have to take the
> result of Initial.exec(), put it in a bag and call Final.exec() which
> can be highly inefficient.
>
>
>
> I would vote for option 1 since it is much better from a performance
> angle.
>
>
>
> Please provide Comments/Suggestions on the proposal.
>
>
>
> Thanks,
>
> Pradeep
>
>
>