You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/07/02 22:43:32 UTC

UDFs and types

With the introduction of types (see 
http://issues.apache.org/jira/browse/PIG-157) we need to decide how 
EvalFunc will interact with the types.  The original proposal was that 
the DEFINE keyword would be modified to allow specification of types for 
the UDF.  This has a couple of problems.  One, DEFINE is already used to 
specify constructor arguments.  Using it to also specify types will be 
confusing.  Two, it has been pointed out that this type information is a 
property of the UDF and should therefore be declared by the UDF, not in 
the script.

Separately, as a way to allow simple function overloading, a change had 
been proposed to the EvalFunc interface to allow an EvalFunc to specify 
that for a given type, a different instance of EvalFunc should be used 
(see https://issues.apache.org/jira/browse/PIG-276).

I would like to propose that we expand the changes in PIG-276 to be more 
general.  Rather than adding classForType() as proposed in PIG-276, 
EvalFunc will instead add a function:

public Map<Schema, FuncSpec> getArgToFuncMapping() {
    return null;
}

Where FuncSpec is a new class that contains the name of the class that 
implements the UDF along with any necessary arguments for the constructor.

The type checker will then, as part of type checking LOUserFunc make a 
call to this function.  If it receives a null, it will simply leave the 
UDF as is, and make the assumption that the UDF can handle whatever 
datatype is being provided to it.  This will cover most existing UDFs, 
which will not override the default implementation.

If a UDF wants to override the default, it should return a map that 
gives a FuncSpec for each type of schema that it can support.  For 
example, for the UDF concat, the map would have two entries:
key: schema(chararray, chararray) value: StringConcat
key: schema(bytearray, bytearray) value: ByteConcat

The type checker will then take the schema of what is being passed to it 
and perform a lookup in the map.  If it finds an entry, it will use the 
associated FuncSpec.  If it does not, it will throw an exception saying 
that that EvalFunc cannot be used with those types.

At this point, the type checker will make no effort to find a best fit 
function.  Either the fit is perfect, or it will not be done.  In the 
future we would like to modify the type checker to select a best fit.  
For example, if a UDF says it can handle schema(long) and the type 
checker finds it has schema(int), it can insert a cast to deal with 
that.  But in the first pass we will ignore this and depend on the user 
to insert the casts.

Thoughts?

Alan.

RE: UDFs and types

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
Paolo [CC'ed] observed that currently, if the return type of the UDF is
a bag or a tuple, the contents of the bag/tuple is not known at type
checking time. In addition to the input parameter types, the return type
of the UDF should also be a schema. This will make the inputs and
outputs well defined and help the type checker enforce type checking and
promotion.

I found a paper that describes algorithms to do fast type inclusion
tests (if a type is a sub-type of another type).

http://www.cs.purdue.edu/homes/jv/pubs/oopsla97.pdf

Santhosh 

-----Original Message-----
From: pi song [mailto:pi.songs@gmail.com] 
Sent: Monday, July 07, 2008 5:58 AM
To: pig-dev@incubator.apache.org
Subject: Re: UDFs and types

You're right. The real problem will be defining rules.

How about?
0) We do only non-nested types first.
1) All number types can be casted to bigger types
    int -> long -> float -> double
2) bytearray can be casted to chararray or double (chararray takes
precedance)
3) Matches on the left are more important than on the right. For
example:-

Input:-
(int, long)

Candidates:-
(int, float)
(float, long)

will match (int, float)

On Fri, Jul 4, 2008 at 1:42 AM, Benjamin Reed <br...@yahoo-inc.com>
wrote:

> You rock Pi!
>
> It might be good to agree on best-fit rules. There are obvious ones:
int
> -> long, float -> double, but what about long -> int, long ->float,
and
> string -> float.
>
> There is also the recursive fits, which might be purely theoretical:
> tuples of the form (long, {float}) fit to (double, {long}) or (int,
> {long}). (That example might be invalid depending on the first answer,
> but hopefully you get the idea.)
>
> ben
>
> pi song wrote:
> > +1 Agree.
> >
> > I will try to make "best fit" happen in 24 hours after you commit
the new
> > UDF design.
> >
> >
> > On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com>
> wrote:
> >
> >
> >> Sounds good to me.
> >>
> >> Olga
> >>
> >>
> >>> -----Original Message-----
> >>> From: Alan Gates [mailto:gates@yahoo-inc.com]
> >>> Sent: Wednesday, July 02, 2008 1:44 PM
> >>> To: pig-dev@incubator.apache.org
> >>> Subject: UDFs and types
> >>>
> >>> With the introduction of types (see
> >>> http://issues.apache.org/jira/browse/PIG-157) we need to
> >>> decide how EvalFunc will interact with the types.  The
> >>> original proposal was that the DEFINE keyword would be
> >>> modified to allow specification of types for the UDF.  This
> >>> has a couple of problems.  One, DEFINE is already used to
> >>> specify constructor arguments.  Using it to also specify
> >>> types will be confusing.  Two, it has been pointed out that
> >>> this type information is a property of the UDF and should
> >>> therefore be declared by the UDF, not in the script.
> >>>
> >>> Separately, as a way to allow simple function overloading, a
> >>> change had been proposed to the EvalFunc interface to allow
> >>> an EvalFunc to specify that for a given type, a different
> >>> instance of EvalFunc should be used (see
> >>> https://issues.apache.org/jira/browse/PIG-276).
> >>>
> >>> I would like to propose that we expand the changes in PIG-276
> >>> to be more general.  Rather than adding classForType() as
> >>> proposed in PIG-276, EvalFunc will instead add a function:
> >>>
> >>> public Map<Schema, FuncSpec> getArgToFuncMapping() {
> >>>     return null;
> >>> }
> >>>
> >>> Where FuncSpec is a new class that contains the name of the
> >>> class that implements the UDF along with any necessary
> >>> arguments for the constructor.
> >>>
> >>> The type checker will then, as part of type checking
> >>> LOUserFunc make a call to this function.  If it receives a
> >>> null, it will simply leave the UDF as is, and make the
> >>> assumption that the UDF can handle whatever datatype is being
> >>> provided to it.  This will cover most existing UDFs, which
> >>> will not override the default implementation.
> >>>
> >>> If a UDF wants to override the default, it should return a
> >>> map that gives a FuncSpec for each type of schema that it can
> >>> support.  For example, for the UDF concat, the map would have
> >>> two entries:
> >>> key: schema(chararray, chararray) value: StringConcat
> >>> key: schema(bytearray, bytearray) value: ByteConcat
> >>>
> >>> The type checker will then take the schema of what is being
> >>> passed to it and perform a lookup in the map.  If it finds an
> >>> entry, it will use the associated FuncSpec.  If it does not,
> >>> it will throw an exception saying that that EvalFunc cannot
> >>> be used with those types.
> >>>
> >>> At this point, the type checker will make no effort to find a
> >>> best fit function.  Either the fit is perfect, or it will not
> >>> be done.  In the future we would like to modify the type
> >>> checker to select a best fit.
> >>> For example, if a UDF says it can handle schema(long) and the
> >>> type checker finds it has schema(int), it can insert a cast
> >>> to deal with that.  But in the first pass we will ignore this
> >>> and depend on the user to insert the casts.
> >>>
> >>> Thoughts?
> >>>
> >>> Alan.
> >>>
> >>>
> >
> >
>
>

Re: UDFs and types

Posted by pi song <pi...@gmail.com>.
You're right. The real problem will be defining rules.

How about?
0) We do only non-nested types first.
1) All number types can be casted to bigger types
    int -> long -> float -> double
2) bytearray can be casted to chararray or double (chararray takes
precedance)
3) Matches on the left are more important than on the right. For example:-

Input:-
(int, long)

Candidates:-
(int, float)
(float, long)

will match (int, float)

On Fri, Jul 4, 2008 at 1:42 AM, Benjamin Reed <br...@yahoo-inc.com> wrote:

> You rock Pi!
>
> It might be good to agree on best-fit rules. There are obvious ones: int
> -> long, float -> double, but what about long -> int, long ->float, and
> string -> float.
>
> There is also the recursive fits, which might be purely theoretical:
> tuples of the form (long, {float}) fit to (double, {long}) or (int,
> {long}). (That example might be invalid depending on the first answer,
> but hopefully you get the idea.)
>
> ben
>
> pi song wrote:
> > +1 Agree.
> >
> > I will try to make "best fit" happen in 24 hours after you commit the new
> > UDF design.
> >
> >
> > On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com>
> wrote:
> >
> >
> >> Sounds good to me.
> >>
> >> Olga
> >>
> >>
> >>> -----Original Message-----
> >>> From: Alan Gates [mailto:gates@yahoo-inc.com]
> >>> Sent: Wednesday, July 02, 2008 1:44 PM
> >>> To: pig-dev@incubator.apache.org
> >>> Subject: UDFs and types
> >>>
> >>> With the introduction of types (see
> >>> http://issues.apache.org/jira/browse/PIG-157) we need to
> >>> decide how EvalFunc will interact with the types.  The
> >>> original proposal was that the DEFINE keyword would be
> >>> modified to allow specification of types for the UDF.  This
> >>> has a couple of problems.  One, DEFINE is already used to
> >>> specify constructor arguments.  Using it to also specify
> >>> types will be confusing.  Two, it has been pointed out that
> >>> this type information is a property of the UDF and should
> >>> therefore be declared by the UDF, not in the script.
> >>>
> >>> Separately, as a way to allow simple function overloading, a
> >>> change had been proposed to the EvalFunc interface to allow
> >>> an EvalFunc to specify that for a given type, a different
> >>> instance of EvalFunc should be used (see
> >>> https://issues.apache.org/jira/browse/PIG-276).
> >>>
> >>> I would like to propose that we expand the changes in PIG-276
> >>> to be more general.  Rather than adding classForType() as
> >>> proposed in PIG-276, EvalFunc will instead add a function:
> >>>
> >>> public Map<Schema, FuncSpec> getArgToFuncMapping() {
> >>>     return null;
> >>> }
> >>>
> >>> Where FuncSpec is a new class that contains the name of the
> >>> class that implements the UDF along with any necessary
> >>> arguments for the constructor.
> >>>
> >>> The type checker will then, as part of type checking
> >>> LOUserFunc make a call to this function.  If it receives a
> >>> null, it will simply leave the UDF as is, and make the
> >>> assumption that the UDF can handle whatever datatype is being
> >>> provided to it.  This will cover most existing UDFs, which
> >>> will not override the default implementation.
> >>>
> >>> If a UDF wants to override the default, it should return a
> >>> map that gives a FuncSpec for each type of schema that it can
> >>> support.  For example, for the UDF concat, the map would have
> >>> two entries:
> >>> key: schema(chararray, chararray) value: StringConcat
> >>> key: schema(bytearray, bytearray) value: ByteConcat
> >>>
> >>> The type checker will then take the schema of what is being
> >>> passed to it and perform a lookup in the map.  If it finds an
> >>> entry, it will use the associated FuncSpec.  If it does not,
> >>> it will throw an exception saying that that EvalFunc cannot
> >>> be used with those types.
> >>>
> >>> At this point, the type checker will make no effort to find a
> >>> best fit function.  Either the fit is perfect, or it will not
> >>> be done.  In the future we would like to modify the type
> >>> checker to select a best fit.
> >>> For example, if a UDF says it can handle schema(long) and the
> >>> type checker finds it has schema(int), it can insert a cast
> >>> to deal with that.  But in the first pass we will ignore this
> >>> and depend on the user to insert the casts.
> >>>
> >>> Thoughts?
> >>>
> >>> Alan.
> >>>
> >>>
> >
> >
>
>

Re: UDFs and types

Posted by Benjamin Reed <br...@yahoo-inc.com>.
You rock Pi!

It might be good to agree on best-fit rules. There are obvious ones: int
-> long, float -> double, but what about long -> int, long ->float, and
string -> float.

There is also the recursive fits, which might be purely theoretical:
tuples of the form (long, {float}) fit to (double, {long}) or (int,
{long}). (That example might be invalid depending on the first answer,
but hopefully you get the idea.)

ben

pi song wrote:
> +1 Agree.
>
> I will try to make "best fit" happen in 24 hours after you commit the new
> UDF design.
>
>
> On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com> wrote:
>
>   
>> Sounds good to me.
>>
>> Olga
>>
>>     
>>> -----Original Message-----
>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>> Sent: Wednesday, July 02, 2008 1:44 PM
>>> To: pig-dev@incubator.apache.org
>>> Subject: UDFs and types
>>>
>>> With the introduction of types (see
>>> http://issues.apache.org/jira/browse/PIG-157) we need to
>>> decide how EvalFunc will interact with the types.  The
>>> original proposal was that the DEFINE keyword would be
>>> modified to allow specification of types for the UDF.  This
>>> has a couple of problems.  One, DEFINE is already used to
>>> specify constructor arguments.  Using it to also specify
>>> types will be confusing.  Two, it has been pointed out that
>>> this type information is a property of the UDF and should
>>> therefore be declared by the UDF, not in the script.
>>>
>>> Separately, as a way to allow simple function overloading, a
>>> change had been proposed to the EvalFunc interface to allow
>>> an EvalFunc to specify that for a given type, a different
>>> instance of EvalFunc should be used (see
>>> https://issues.apache.org/jira/browse/PIG-276).
>>>
>>> I would like to propose that we expand the changes in PIG-276
>>> to be more general.  Rather than adding classForType() as
>>> proposed in PIG-276, EvalFunc will instead add a function:
>>>
>>> public Map<Schema, FuncSpec> getArgToFuncMapping() {
>>>     return null;
>>> }
>>>
>>> Where FuncSpec is a new class that contains the name of the
>>> class that implements the UDF along with any necessary
>>> arguments for the constructor.
>>>
>>> The type checker will then, as part of type checking
>>> LOUserFunc make a call to this function.  If it receives a
>>> null, it will simply leave the UDF as is, and make the
>>> assumption that the UDF can handle whatever datatype is being
>>> provided to it.  This will cover most existing UDFs, which
>>> will not override the default implementation.
>>>
>>> If a UDF wants to override the default, it should return a
>>> map that gives a FuncSpec for each type of schema that it can
>>> support.  For example, for the UDF concat, the map would have
>>> two entries:
>>> key: schema(chararray, chararray) value: StringConcat
>>> key: schema(bytearray, bytearray) value: ByteConcat
>>>
>>> The type checker will then take the schema of what is being
>>> passed to it and perform a lookup in the map.  If it finds an
>>> entry, it will use the associated FuncSpec.  If it does not,
>>> it will throw an exception saying that that EvalFunc cannot
>>> be used with those types.
>>>
>>> At this point, the type checker will make no effort to find a
>>> best fit function.  Either the fit is perfect, or it will not
>>> be done.  In the future we would like to modify the type
>>> checker to select a best fit.
>>> For example, if a UDF says it can handle schema(long) and the
>>> type checker finds it has schema(int), it can insert a cast
>>> to deal with that.  But in the first pass we will ignore this
>>> and depend on the user to insert the casts.
>>>
>>> Thoughts?
>>>
>>> Alan.
>>>
>>>       
>
>   


Re: UDFs and types

Posted by pi song <pi...@gmail.com>.
+1 Agree.

I will try to make "best fit" happen in 24 hours after you commit the new
UDF design.


On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> Sounds good to me.
>
> Olga
>
> > -----Original Message-----
> > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > Sent: Wednesday, July 02, 2008 1:44 PM
> > To: pig-dev@incubator.apache.org
> > Subject: UDFs and types
> >
> > With the introduction of types (see
> > http://issues.apache.org/jira/browse/PIG-157) we need to
> > decide how EvalFunc will interact with the types.  The
> > original proposal was that the DEFINE keyword would be
> > modified to allow specification of types for the UDF.  This
> > has a couple of problems.  One, DEFINE is already used to
> > specify constructor arguments.  Using it to also specify
> > types will be confusing.  Two, it has been pointed out that
> > this type information is a property of the UDF and should
> > therefore be declared by the UDF, not in the script.
> >
> > Separately, as a way to allow simple function overloading, a
> > change had been proposed to the EvalFunc interface to allow
> > an EvalFunc to specify that for a given type, a different
> > instance of EvalFunc should be used (see
> > https://issues.apache.org/jira/browse/PIG-276).
> >
> > I would like to propose that we expand the changes in PIG-276
> > to be more general.  Rather than adding classForType() as
> > proposed in PIG-276, EvalFunc will instead add a function:
> >
> > public Map<Schema, FuncSpec> getArgToFuncMapping() {
> >     return null;
> > }
> >
> > Where FuncSpec is a new class that contains the name of the
> > class that implements the UDF along with any necessary
> > arguments for the constructor.
> >
> > The type checker will then, as part of type checking
> > LOUserFunc make a call to this function.  If it receives a
> > null, it will simply leave the UDF as is, and make the
> > assumption that the UDF can handle whatever datatype is being
> > provided to it.  This will cover most existing UDFs, which
> > will not override the default implementation.
> >
> > If a UDF wants to override the default, it should return a
> > map that gives a FuncSpec for each type of schema that it can
> > support.  For example, for the UDF concat, the map would have
> > two entries:
> > key: schema(chararray, chararray) value: StringConcat
> > key: schema(bytearray, bytearray) value: ByteConcat
> >
> > The type checker will then take the schema of what is being
> > passed to it and perform a lookup in the map.  If it finds an
> > entry, it will use the associated FuncSpec.  If it does not,
> > it will throw an exception saying that that EvalFunc cannot
> > be used with those types.
> >
> > At this point, the type checker will make no effort to find a
> > best fit function.  Either the fit is perfect, or it will not
> > be done.  In the future we would like to modify the type
> > checker to select a best fit.
> > For example, if a UDF says it can handle schema(long) and the
> > type checker finds it has schema(int), it can insert a cast
> > to deal with that.  But in the first pass we will ignore this
> > and depend on the user to insert the casts.
> >
> > Thoughts?
> >
> > Alan.
> >
>

RE: UDFs and types

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Sounds good to me.

Olga 

> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Wednesday, July 02, 2008 1:44 PM
> To: pig-dev@incubator.apache.org
> Subject: UDFs and types
> 
> With the introduction of types (see
> http://issues.apache.org/jira/browse/PIG-157) we need to 
> decide how EvalFunc will interact with the types.  The 
> original proposal was that the DEFINE keyword would be 
> modified to allow specification of types for the UDF.  This 
> has a couple of problems.  One, DEFINE is already used to 
> specify constructor arguments.  Using it to also specify 
> types will be confusing.  Two, it has been pointed out that 
> this type information is a property of the UDF and should 
> therefore be declared by the UDF, not in the script.
> 
> Separately, as a way to allow simple function overloading, a 
> change had been proposed to the EvalFunc interface to allow 
> an EvalFunc to specify that for a given type, a different 
> instance of EvalFunc should be used (see 
> https://issues.apache.org/jira/browse/PIG-276).
> 
> I would like to propose that we expand the changes in PIG-276 
> to be more general.  Rather than adding classForType() as 
> proposed in PIG-276, EvalFunc will instead add a function:
> 
> public Map<Schema, FuncSpec> getArgToFuncMapping() {
>     return null;
> }
> 
> Where FuncSpec is a new class that contains the name of the 
> class that implements the UDF along with any necessary 
> arguments for the constructor.
> 
> The type checker will then, as part of type checking 
> LOUserFunc make a call to this function.  If it receives a 
> null, it will simply leave the UDF as is, and make the 
> assumption that the UDF can handle whatever datatype is being 
> provided to it.  This will cover most existing UDFs, which 
> will not override the default implementation.
> 
> If a UDF wants to override the default, it should return a 
> map that gives a FuncSpec for each type of schema that it can 
> support.  For example, for the UDF concat, the map would have 
> two entries:
> key: schema(chararray, chararray) value: StringConcat
> key: schema(bytearray, bytearray) value: ByteConcat
> 
> The type checker will then take the schema of what is being 
> passed to it and perform a lookup in the map.  If it finds an 
> entry, it will use the associated FuncSpec.  If it does not, 
> it will throw an exception saying that that EvalFunc cannot 
> be used with those types.
> 
> At this point, the type checker will make no effort to find a 
> best fit function.  Either the fit is perfect, or it will not 
> be done.  In the future we would like to modify the type 
> checker to select a best fit.  
> For example, if a UDF says it can handle schema(long) and the 
> type checker finds it has schema(int), it can insert a cast 
> to deal with that.  But in the first pass we will ignore this 
> and depend on the user to insert the casts.
> 
> Thoughts?
> 
> Alan.
> 

Re: UDFs and types

Posted by Tanton Gibbs <ta...@gmail.com>.
What about using annotations for this?

Could we create an annotation say @UDF that allowed us to specify an
input schema?

I imagine you could put quite a bit of information into the annotation
such as function name, input args, return type, etc...

On Wed, Jul 2, 2008 at 3:43 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> With the introduction of types (see
> http://issues.apache.org/jira/browse/PIG-157) we need to decide how EvalFunc
> will interact with the types.  The original proposal was that the DEFINE
> keyword would be modified to allow specification of types for the UDF.  This
> has a couple of problems.  One, DEFINE is already used to specify
> constructor arguments.  Using it to also specify types will be confusing.
>  Two, it has been pointed out that this type information is a property of
> the UDF and should therefore be declared by the UDF, not in the script.
>
> Separately, as a way to allow simple function overloading, a change had been
> proposed to the EvalFunc interface to allow an EvalFunc to specify that for
> a given type, a different instance of EvalFunc should be used (see
> https://issues.apache.org/jira/browse/PIG-276).
>
> I would like to propose that we expand the changes in PIG-276 to be more
> general.  Rather than adding classForType() as proposed in PIG-276, EvalFunc
> will instead add a function:
>
> public Map<Schema, FuncSpec> getArgToFuncMapping() {
>   return null;
> }
>
> Where FuncSpec is a new class that contains the name of the class that
> implements the UDF along with any necessary arguments for the constructor.
>
> The type checker will then, as part of type checking LOUserFunc make a call
> to this function.  If it receives a null, it will simply leave the UDF as
> is, and make the assumption that the UDF can handle whatever datatype is
> being provided to it.  This will cover most existing UDFs, which will not
> override the default implementation.
>
> If a UDF wants to override the default, it should return a map that gives a
> FuncSpec for each type of schema that it can support.  For example, for the
> UDF concat, the map would have two entries:
> key: schema(chararray, chararray) value: StringConcat
> key: schema(bytearray, bytearray) value: ByteConcat
>
> The type checker will then take the schema of what is being passed to it and
> perform a lookup in the map.  If it finds an entry, it will use the
> associated FuncSpec.  If it does not, it will throw an exception saying that
> that EvalFunc cannot be used with those types.
>
> At this point, the type checker will make no effort to find a best fit
> function.  Either the fit is perfect, or it will not be done.  In the future
> we would like to modify the type checker to select a best fit.  For example,
> if a UDF says it can handle schema(long) and the type checker finds it has
> schema(int), it can insert a cast to deal with that.  But in the first pass
> we will ignore this and depend on the user to insert the casts.
>
> Thoughts?
>
> Alan.
>