You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/07/02 22:43:32 UTC
UDFs and types
With the introduction of types (see
http://issues.apache.org/jira/browse/PIG-157) we need to decide how
EvalFunc will interact with the types. The original proposal was that
the DEFINE keyword would be modified to allow specification of types for
the UDF. This has a couple of problems. One, DEFINE is already used to
specify constructor arguments. Using it to also specify types will be
confusing. Two, it has been pointed out that this type information is a
property of the UDF and should therefore be declared by the UDF, not in
the script.
Separately, as a way to allow simple function overloading, a change had
been proposed to the EvalFunc interface to allow an EvalFunc to specify
that for a given type, a different instance of EvalFunc should be used
(see https://issues.apache.org/jira/browse/PIG-276).
I would like to propose that we expand the changes in PIG-276 to be more
general. Rather than adding classForType() as proposed in PIG-276,
EvalFunc will instead add a function:
public Map<Schema, FuncSpec> getArgToFuncMapping() {
return null;
}
Where FuncSpec is a new class that contains the name of the class that
implements the UDF along with any necessary arguments for the constructor.
The type checker will then, as part of type checking LOUserFunc make a
call to this function. If it receives a null, it will simply leave the
UDF as is, and make the assumption that the UDF can handle whatever
datatype is being provided to it. This will cover most existing UDFs,
which will not override the default implementation.
If a UDF wants to override the default, it should return a map that
gives a FuncSpec for each type of schema that it can support. For
example, for the UDF concat, the map would have two entries:
key: schema(chararray, chararray) value: StringConcat
key: schema(bytearray, bytearray) value: ByteConcat
The type checker will then take the schema of what is being passed to it
and perform a lookup in the map. If it finds an entry, it will use the
associated FuncSpec. If it does not, it will throw an exception saying
that that EvalFunc cannot be used with those types.
At this point, the type checker will make no effort to find a best fit
function. Either the fit is perfect, or it will not be done. In the
future we would like to modify the type checker to select a best fit.
For example, if a UDF says it can handle schema(long) and the type
checker finds it has schema(int), it can insert a cast to deal with
that. But in the first pass we will ignore this and depend on the user
to insert the casts.
Thoughts?
Alan.
RE: UDFs and types
Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
Paolo [CC'ed] observed that currently, if the return type of the UDF is
a bag or a tuple, the contents of the bag/tuple is not known at type
checking time. In addition to the input parameter types, the return type
of the UDF should also be a schema. This will make the inputs and
outputs well defined and help the type checker enforce type checking and
promotion.
I found a paper that describes algorithms to do fast type inclusion
tests (if a type is a sub-type of another type).
http://www.cs.purdue.edu/homes/jv/pubs/oopsla97.pdf
Santhosh
-----Original Message-----
From: pi song [mailto:pi.songs@gmail.com]
Sent: Monday, July 07, 2008 5:58 AM
To: pig-dev@incubator.apache.org
Subject: Re: UDFs and types
You're right. The real problem will be defining rules.
How about?
0) We do only non-nested types first.
1) All number types can be casted to bigger types
int -> long -> float -> double
2) bytearray can be casted to chararray or double (chararray takes
precedance)
3) Matches on the left are more important than on the right. For
example:-
Input:-
(int, long)
Candidates:-
(int, float)
(float, long)
will match (int, float)
On Fri, Jul 4, 2008 at 1:42 AM, Benjamin Reed <br...@yahoo-inc.com>
wrote:
> You rock Pi!
>
> It might be good to agree on best-fit rules. There are obvious ones:
int
> -> long, float -> double, but what about long -> int, long ->float,
and
> string -> float.
>
> There is also the recursive fits, which might be purely theoretical:
> tuples of the form (long, {float}) fit to (double, {long}) or (int,
> {long}). (That example might be invalid depending on the first answer,
> but hopefully you get the idea.)
>
> ben
>
> pi song wrote:
> > +1 Agree.
> >
> > I will try to make "best fit" happen in 24 hours after you commit
the new
> > UDF design.
> >
> >
> > On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com>
> wrote:
> >
> >
> >> Sounds good to me.
> >>
> >> Olga
> >>
> >>
> >>> -----Original Message-----
> >>> From: Alan Gates [mailto:gates@yahoo-inc.com]
> >>> Sent: Wednesday, July 02, 2008 1:44 PM
> >>> To: pig-dev@incubator.apache.org
> >>> Subject: UDFs and types
> >>>
> >>> With the introduction of types (see
> >>> http://issues.apache.org/jira/browse/PIG-157) we need to
> >>> decide how EvalFunc will interact with the types. The
> >>> original proposal was that the DEFINE keyword would be
> >>> modified to allow specification of types for the UDF. This
> >>> has a couple of problems. One, DEFINE is already used to
> >>> specify constructor arguments. Using it to also specify
> >>> types will be confusing. Two, it has been pointed out that
> >>> this type information is a property of the UDF and should
> >>> therefore be declared by the UDF, not in the script.
> >>>
> >>> Separately, as a way to allow simple function overloading, a
> >>> change had been proposed to the EvalFunc interface to allow
> >>> an EvalFunc to specify that for a given type, a different
> >>> instance of EvalFunc should be used (see
> >>> https://issues.apache.org/jira/browse/PIG-276).
> >>>
> >>> I would like to propose that we expand the changes in PIG-276
> >>> to be more general. Rather than adding classForType() as
> >>> proposed in PIG-276, EvalFunc will instead add a function:
> >>>
> >>> public Map<Schema, FuncSpec> getArgToFuncMapping() {
> >>> return null;
> >>> }
> >>>
> >>> Where FuncSpec is a new class that contains the name of the
> >>> class that implements the UDF along with any necessary
> >>> arguments for the constructor.
> >>>
> >>> The type checker will then, as part of type checking
> >>> LOUserFunc make a call to this function. If it receives a
> >>> null, it will simply leave the UDF as is, and make the
> >>> assumption that the UDF can handle whatever datatype is being
> >>> provided to it. This will cover most existing UDFs, which
> >>> will not override the default implementation.
> >>>
> >>> If a UDF wants to override the default, it should return a
> >>> map that gives a FuncSpec for each type of schema that it can
> >>> support. For example, for the UDF concat, the map would have
> >>> two entries:
> >>> key: schema(chararray, chararray) value: StringConcat
> >>> key: schema(bytearray, bytearray) value: ByteConcat
> >>>
> >>> The type checker will then take the schema of what is being
> >>> passed to it and perform a lookup in the map. If it finds an
> >>> entry, it will use the associated FuncSpec. If it does not,
> >>> it will throw an exception saying that that EvalFunc cannot
> >>> be used with those types.
> >>>
> >>> At this point, the type checker will make no effort to find a
> >>> best fit function. Either the fit is perfect, or it will not
> >>> be done. In the future we would like to modify the type
> >>> checker to select a best fit.
> >>> For example, if a UDF says it can handle schema(long) and the
> >>> type checker finds it has schema(int), it can insert a cast
> >>> to deal with that. But in the first pass we will ignore this
> >>> and depend on the user to insert the casts.
> >>>
> >>> Thoughts?
> >>>
> >>> Alan.
> >>>
> >>>
> >
> >
>
>
Re: UDFs and types
Posted by pi song <pi...@gmail.com>.
You're right. The real problem will be defining rules.
How about?
0) We do only non-nested types first.
1) All number types can be casted to bigger types
int -> long -> float -> double
2) bytearray can be casted to chararray or double (chararray takes
precedance)
3) Matches on the left are more important than on the right. For example:-
Input:-
(int, long)
Candidates:-
(int, float)
(float, long)
will match (int, float)
On Fri, Jul 4, 2008 at 1:42 AM, Benjamin Reed <br...@yahoo-inc.com> wrote:
> You rock Pi!
>
> It might be good to agree on best-fit rules. There are obvious ones: int
> -> long, float -> double, but what about long -> int, long ->float, and
> string -> float.
>
> There is also the recursive fits, which might be purely theoretical:
> tuples of the form (long, {float}) fit to (double, {long}) or (int,
> {long}). (That example might be invalid depending on the first answer,
> but hopefully you get the idea.)
>
> ben
>
> pi song wrote:
> > +1 Agree.
> >
> > I will try to make "best fit" happen in 24 hours after you commit the new
> > UDF design.
> >
> >
> > On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com>
> wrote:
> >
> >
> >> Sounds good to me.
> >>
> >> Olga
> >>
> >>
> >>> -----Original Message-----
> >>> From: Alan Gates [mailto:gates@yahoo-inc.com]
> >>> Sent: Wednesday, July 02, 2008 1:44 PM
> >>> To: pig-dev@incubator.apache.org
> >>> Subject: UDFs and types
> >>>
> >>> With the introduction of types (see
> >>> http://issues.apache.org/jira/browse/PIG-157) we need to
> >>> decide how EvalFunc will interact with the types. The
> >>> original proposal was that the DEFINE keyword would be
> >>> modified to allow specification of types for the UDF. This
> >>> has a couple of problems. One, DEFINE is already used to
> >>> specify constructor arguments. Using it to also specify
> >>> types will be confusing. Two, it has been pointed out that
> >>> this type information is a property of the UDF and should
> >>> therefore be declared by the UDF, not in the script.
> >>>
> >>> Separately, as a way to allow simple function overloading, a
> >>> change had been proposed to the EvalFunc interface to allow
> >>> an EvalFunc to specify that for a given type, a different
> >>> instance of EvalFunc should be used (see
> >>> https://issues.apache.org/jira/browse/PIG-276).
> >>>
> >>> I would like to propose that we expand the changes in PIG-276
> >>> to be more general. Rather than adding classForType() as
> >>> proposed in PIG-276, EvalFunc will instead add a function:
> >>>
> >>> public Map<Schema, FuncSpec> getArgToFuncMapping() {
> >>> return null;
> >>> }
> >>>
> >>> Where FuncSpec is a new class that contains the name of the
> >>> class that implements the UDF along with any necessary
> >>> arguments for the constructor.
> >>>
> >>> The type checker will then, as part of type checking
> >>> LOUserFunc make a call to this function. If it receives a
> >>> null, it will simply leave the UDF as is, and make the
> >>> assumption that the UDF can handle whatever datatype is being
> >>> provided to it. This will cover most existing UDFs, which
> >>> will not override the default implementation.
> >>>
> >>> If a UDF wants to override the default, it should return a
> >>> map that gives a FuncSpec for each type of schema that it can
> >>> support. For example, for the UDF concat, the map would have
> >>> two entries:
> >>> key: schema(chararray, chararray) value: StringConcat
> >>> key: schema(bytearray, bytearray) value: ByteConcat
> >>>
> >>> The type checker will then take the schema of what is being
> >>> passed to it and perform a lookup in the map. If it finds an
> >>> entry, it will use the associated FuncSpec. If it does not,
> >>> it will throw an exception saying that that EvalFunc cannot
> >>> be used with those types.
> >>>
> >>> At this point, the type checker will make no effort to find a
> >>> best fit function. Either the fit is perfect, or it will not
> >>> be done. In the future we would like to modify the type
> >>> checker to select a best fit.
> >>> For example, if a UDF says it can handle schema(long) and the
> >>> type checker finds it has schema(int), it can insert a cast
> >>> to deal with that. But in the first pass we will ignore this
> >>> and depend on the user to insert the casts.
> >>>
> >>> Thoughts?
> >>>
> >>> Alan.
> >>>
> >>>
> >
> >
>
>
Re: UDFs and types
Posted by Benjamin Reed <br...@yahoo-inc.com>.
You rock Pi!
It might be good to agree on best-fit rules. There are obvious ones: int
-> long, float -> double, but what about long -> int, long ->float, and
string -> float.
There is also the recursive fits, which might be purely theoretical:
tuples of the form (long, {float}) fit to (double, {long}) or (int,
{long}). (That example might be invalid depending on the first answer,
but hopefully you get the idea.)
ben
pi song wrote:
> +1 Agree.
>
> I will try to make "best fit" happen in 24 hours after you commit the new
> UDF design.
>
>
> On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com> wrote:
>
>
>> Sounds good to me.
>>
>> Olga
>>
>>
>>> -----Original Message-----
>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>> Sent: Wednesday, July 02, 2008 1:44 PM
>>> To: pig-dev@incubator.apache.org
>>> Subject: UDFs and types
>>>
>>> With the introduction of types (see
>>> http://issues.apache.org/jira/browse/PIG-157) we need to
>>> decide how EvalFunc will interact with the types. The
>>> original proposal was that the DEFINE keyword would be
>>> modified to allow specification of types for the UDF. This
>>> has a couple of problems. One, DEFINE is already used to
>>> specify constructor arguments. Using it to also specify
>>> types will be confusing. Two, it has been pointed out that
>>> this type information is a property of the UDF and should
>>> therefore be declared by the UDF, not in the script.
>>>
>>> Separately, as a way to allow simple function overloading, a
>>> change had been proposed to the EvalFunc interface to allow
>>> an EvalFunc to specify that for a given type, a different
>>> instance of EvalFunc should be used (see
>>> https://issues.apache.org/jira/browse/PIG-276).
>>>
>>> I would like to propose that we expand the changes in PIG-276
>>> to be more general. Rather than adding classForType() as
>>> proposed in PIG-276, EvalFunc will instead add a function:
>>>
>>> public Map<Schema, FuncSpec> getArgToFuncMapping() {
>>> return null;
>>> }
>>>
>>> Where FuncSpec is a new class that contains the name of the
>>> class that implements the UDF along with any necessary
>>> arguments for the constructor.
>>>
>>> The type checker will then, as part of type checking
>>> LOUserFunc make a call to this function. If it receives a
>>> null, it will simply leave the UDF as is, and make the
>>> assumption that the UDF can handle whatever datatype is being
>>> provided to it. This will cover most existing UDFs, which
>>> will not override the default implementation.
>>>
>>> If a UDF wants to override the default, it should return a
>>> map that gives a FuncSpec for each type of schema that it can
>>> support. For example, for the UDF concat, the map would have
>>> two entries:
>>> key: schema(chararray, chararray) value: StringConcat
>>> key: schema(bytearray, bytearray) value: ByteConcat
>>>
>>> The type checker will then take the schema of what is being
>>> passed to it and perform a lookup in the map. If it finds an
>>> entry, it will use the associated FuncSpec. If it does not,
>>> it will throw an exception saying that that EvalFunc cannot
>>> be used with those types.
>>>
>>> At this point, the type checker will make no effort to find a
>>> best fit function. Either the fit is perfect, or it will not
>>> be done. In the future we would like to modify the type
>>> checker to select a best fit.
>>> For example, if a UDF says it can handle schema(long) and the
>>> type checker finds it has schema(int), it can insert a cast
>>> to deal with that. But in the first pass we will ignore this
>>> and depend on the user to insert the casts.
>>>
>>> Thoughts?
>>>
>>> Alan.
>>>
>>>
>
>
Re: UDFs and types
Posted by pi song <pi...@gmail.com>.
+1 Agree.
I will try to make "best fit" happen in 24 hours after you commit the new
UDF design.
On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> Sounds good to me.
>
> Olga
>
> > -----Original Message-----
> > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > Sent: Wednesday, July 02, 2008 1:44 PM
> > To: pig-dev@incubator.apache.org
> > Subject: UDFs and types
> >
> > With the introduction of types (see
> > http://issues.apache.org/jira/browse/PIG-157) we need to
> > decide how EvalFunc will interact with the types. The
> > original proposal was that the DEFINE keyword would be
> > modified to allow specification of types for the UDF. This
> > has a couple of problems. One, DEFINE is already used to
> > specify constructor arguments. Using it to also specify
> > types will be confusing. Two, it has been pointed out that
> > this type information is a property of the UDF and should
> > therefore be declared by the UDF, not in the script.
> >
> > Separately, as a way to allow simple function overloading, a
> > change had been proposed to the EvalFunc interface to allow
> > an EvalFunc to specify that for a given type, a different
> > instance of EvalFunc should be used (see
> > https://issues.apache.org/jira/browse/PIG-276).
> >
> > I would like to propose that we expand the changes in PIG-276
> > to be more general. Rather than adding classForType() as
> > proposed in PIG-276, EvalFunc will instead add a function:
> >
> > public Map<Schema, FuncSpec> getArgToFuncMapping() {
> > return null;
> > }
> >
> > Where FuncSpec is a new class that contains the name of the
> > class that implements the UDF along with any necessary
> > arguments for the constructor.
> >
> > The type checker will then, as part of type checking
> > LOUserFunc make a call to this function. If it receives a
> > null, it will simply leave the UDF as is, and make the
> > assumption that the UDF can handle whatever datatype is being
> > provided to it. This will cover most existing UDFs, which
> > will not override the default implementation.
> >
> > If a UDF wants to override the default, it should return a
> > map that gives a FuncSpec for each type of schema that it can
> > support. For example, for the UDF concat, the map would have
> > two entries:
> > key: schema(chararray, chararray) value: StringConcat
> > key: schema(bytearray, bytearray) value: ByteConcat
> >
> > The type checker will then take the schema of what is being
> > passed to it and perform a lookup in the map. If it finds an
> > entry, it will use the associated FuncSpec. If it does not,
> > it will throw an exception saying that that EvalFunc cannot
> > be used with those types.
> >
> > At this point, the type checker will make no effort to find a
> > best fit function. Either the fit is perfect, or it will not
> > be done. In the future we would like to modify the type
> > checker to select a best fit.
> > For example, if a UDF says it can handle schema(long) and the
> > type checker finds it has schema(int), it can insert a cast
> > to deal with that. But in the first pass we will ignore this
> > and depend on the user to insert the casts.
> >
> > Thoughts?
> >
> > Alan.
> >
>
RE: UDFs and types
Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Sounds good to me.
Olga
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Wednesday, July 02, 2008 1:44 PM
> To: pig-dev@incubator.apache.org
> Subject: UDFs and types
>
> With the introduction of types (see
> http://issues.apache.org/jira/browse/PIG-157) we need to
> decide how EvalFunc will interact with the types. The
> original proposal was that the DEFINE keyword would be
> modified to allow specification of types for the UDF. This
> has a couple of problems. One, DEFINE is already used to
> specify constructor arguments. Using it to also specify
> types will be confusing. Two, it has been pointed out that
> this type information is a property of the UDF and should
> therefore be declared by the UDF, not in the script.
>
> Separately, as a way to allow simple function overloading, a
> change had been proposed to the EvalFunc interface to allow
> an EvalFunc to specify that for a given type, a different
> instance of EvalFunc should be used (see
> https://issues.apache.org/jira/browse/PIG-276).
>
> I would like to propose that we expand the changes in PIG-276
> to be more general. Rather than adding classForType() as
> proposed in PIG-276, EvalFunc will instead add a function:
>
> public Map<Schema, FuncSpec> getArgToFuncMapping() {
> return null;
> }
>
> Where FuncSpec is a new class that contains the name of the
> class that implements the UDF along with any necessary
> arguments for the constructor.
>
> The type checker will then, as part of type checking
> LOUserFunc make a call to this function. If it receives a
> null, it will simply leave the UDF as is, and make the
> assumption that the UDF can handle whatever datatype is being
> provided to it. This will cover most existing UDFs, which
> will not override the default implementation.
>
> If a UDF wants to override the default, it should return a
> map that gives a FuncSpec for each type of schema that it can
> support. For example, for the UDF concat, the map would have
> two entries:
> key: schema(chararray, chararray) value: StringConcat
> key: schema(bytearray, bytearray) value: ByteConcat
>
> The type checker will then take the schema of what is being
> passed to it and perform a lookup in the map. If it finds an
> entry, it will use the associated FuncSpec. If it does not,
> it will throw an exception saying that that EvalFunc cannot
> be used with those types.
>
> At this point, the type checker will make no effort to find a
> best fit function. Either the fit is perfect, or it will not
> be done. In the future we would like to modify the type
> checker to select a best fit.
> For example, if a UDF says it can handle schema(long) and the
> type checker finds it has schema(int), it can insert a cast
> to deal with that. But in the first pass we will ignore this
> and depend on the user to insert the casts.
>
> Thoughts?
>
> Alan.
>
Re: UDFs and types
Posted by Tanton Gibbs <ta...@gmail.com>.
What about using annotations for this?
Could we create an annotation say @UDF that allowed us to specify an
input schema?
I imagine you could put quite a bit of information into the annotation
such as function name, input args, return type, etc...
On Wed, Jul 2, 2008 at 3:43 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> With the introduction of types (see
> http://issues.apache.org/jira/browse/PIG-157) we need to decide how EvalFunc
> will interact with the types. The original proposal was that the DEFINE
> keyword would be modified to allow specification of types for the UDF. This
> has a couple of problems. One, DEFINE is already used to specify
> constructor arguments. Using it to also specify types will be confusing.
> Two, it has been pointed out that this type information is a property of
> the UDF and should therefore be declared by the UDF, not in the script.
>
> Separately, as a way to allow simple function overloading, a change had been
> proposed to the EvalFunc interface to allow an EvalFunc to specify that for
> a given type, a different instance of EvalFunc should be used (see
> https://issues.apache.org/jira/browse/PIG-276).
>
> I would like to propose that we expand the changes in PIG-276 to be more
> general. Rather than adding classForType() as proposed in PIG-276, EvalFunc
> will instead add a function:
>
> public Map<Schema, FuncSpec> getArgToFuncMapping() {
> return null;
> }
>
> Where FuncSpec is a new class that contains the name of the class that
> implements the UDF along with any necessary arguments for the constructor.
>
> The type checker will then, as part of type checking LOUserFunc make a call
> to this function. If it receives a null, it will simply leave the UDF as
> is, and make the assumption that the UDF can handle whatever datatype is
> being provided to it. This will cover most existing UDFs, which will not
> override the default implementation.
>
> If a UDF wants to override the default, it should return a map that gives a
> FuncSpec for each type of schema that it can support. For example, for the
> UDF concat, the map would have two entries:
> key: schema(chararray, chararray) value: StringConcat
> key: schema(bytearray, bytearray) value: ByteConcat
>
> The type checker will then take the schema of what is being passed to it and
> perform a lookup in the map. If it finds an entry, it will use the
> associated FuncSpec. If it does not, it will throw an exception saying that
> that EvalFunc cannot be used with those types.
>
> At this point, the type checker will make no effort to find a best fit
> function. Either the fit is perfect, or it will not be done. In the future
> we would like to modify the type checker to select a best fit. For example,
> if a UDF says it can handle schema(long) and the type checker finds it has
> schema(int), it can insert a cast to deal with that. But in the first pass
> we will ignore this and depend on the user to insert the casts.
>
> Thoughts?
>
> Alan.
>