You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mark Tozzi <ma...@gmail.com> on 2010/07/19 22:26:38 UTC
UDF Return type best practice question
Hi All,
I've been working with UDFs in hive a lot lately, usually to implement
some manner of small lookup which isn't worth the overhead of a join,
or which for some other reason is preferable as a function as a join.
This gets me into situations where I end up wanting one UDF to have
multiple return types - for example something like a geo IP look-up
would return an integer for an area code look-up or a string for a
country name look-up. It seems the two ways to handle this are to
either write a different UDF for each return type, or potentially each
look-up, or to always return a String and use the hive built in cast
function "cast(expr as <type>)" on the return value. So far I've been
favoring the second as the first seems to lead to a proliferation of
nearly identical classes, but I'm wondering if someone with more
experience in this might have a suggestion as why one might be better
than the other, or indeed if there is a third solution that I have
overlooked.
Thanks,
--Mark Tozzi
Re: UDF Return type best practice question
Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Jul 19, 2010 at 4:26 PM, Mark Tozzi <ma...@gmail.com> wrote:
> Hi All,
>
> I've been working with UDFs in hive a lot lately, usually to implement
> some manner of small lookup which isn't worth the overhead of a join,
> or which for some other reason is preferable as a function as a join.
> This gets me into situations where I end up wanting one UDF to have
> multiple return types - for example something like a geo IP look-up
> would return an integer for an area code look-up or a string for a
> country name look-up. It seems the two ways to handle this are to
> either write a different UDF for each return type, or potentially each
> look-up, or to always return a String and use the hive built in cast
> function "cast(expr as <type>)" on the return value. So far I've been
> favoring the second as the first seems to lead to a proliferation of
> nearly identical classes, but I'm wondering if someone with more
> experience in this might have a suggestion as why one might be better
> than the other, or indeed if there is a third solution that I have
> overlooked.
>
> Thanks,
>
> --Mark Tozzi
>
The genericUDF interface allows you to define the return type based on
the parameters passed it it. Thus it is more flexible then a UDF.
Check out the CASE generic UDF to see how this is done.
RE: UDF Return type best practice question
Posted by Ashish Thusoo <at...@facebook.com>.
You could do this through a UDTF.
http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
Ashish
-----Original Message-----
From: Mark Tozzi [mailto:mark.tozzi@gmail.com]
Sent: Monday, July 19, 2010 1:27 PM
To: Hive User List
Subject: UDF Return type best practice question
Hi All,
I've been working with UDFs in hive a lot lately, usually to implement some manner of small lookup which isn't worth the overhead of a join, or which for some other reason is preferable as a function as a join.
This gets me into situations where I end up wanting one UDF to have multiple return types - for example something like a geo IP look-up would return an integer for an area code look-up or a string for a country name look-up. It seems the two ways to handle this are to either write a different UDF for each return type, or potentially each look-up, or to always return a String and use the hive built in cast function "cast(expr as <type>)" on the return value. So far I've been favoring the second as the first seems to lead to a proliferation of nearly identical classes, but I'm wondering if someone with more experience in this might have a suggestion as why one might be better than the other, or indeed if there is a third solution that I have overlooked.
Thanks,
--Mark Tozzi
Re: UDF Return type best practice question
Posted by John Sichi <js...@facebook.com>.
You can use the GenericUDF option (rather than the reflective option) to avoid duplicating code. For an example, see GenericUDFIf, which implements the if(cond,expr1,expr2) expression.
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIf.java
JVS
On Jul 19, 2010, at 1:26 PM, Mark Tozzi wrote:
> Hi All,
>
> I've been working with UDFs in hive a lot lately, usually to implement
> some manner of small lookup which isn't worth the overhead of a join,
> or which for some other reason is preferable as a function as a join.
> This gets me into situations where I end up wanting one UDF to have
> multiple return types - for example something like a geo IP look-up
> would return an integer for an area code look-up or a string for a
> country name look-up. It seems the two ways to handle this are to
> either write a different UDF for each return type, or potentially each
> look-up, or to always return a String and use the hive built in cast
> function "cast(expr as <type>)" on the return value. So far I've been
> favoring the second as the first seems to lead to a proliferation of
> nearly identical classes, but I'm wondering if someone with more
> experience in this might have a suggestion as why one might be better
> than the other, or indeed if there is a third solution that I have
> overlooked.
>
> Thanks,
>
> --Mark Tozzi