You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Tim Robertson <ti...@gmail.com> on 2010/04/27 16:22:02 UTC

Java UDF

Hi,

I currently run a MapReduce job to rewrite a tab delimited file, and then I
use Hive for everything after that stage.

Am I correct in thinking that I can create a Jar with my own method which
can then be called in SQL?

Would the syntax be:

  hive> ADD JAR /tmp/parse.jar;
  hive> INSERT OVERWRITE TABLE target SELECT s.id,
s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
parsedName;

and parse be a MR job?  If so what are the input and output formats please
for the parse?  Or is it a class implementing an interface perhaps and Hive
take care of the rest?

Thanks for any pointers,
Tim

Re: Java UDF

Posted by Tim Robertson <ti...@gmail.com>.
Thanks Edward,

I get where you are coming from now with that explanation.

Cheers,
Tim


On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <ti...@gmail.com>wrote:
>
>> Hmmm... I am not trying to serialize or deserialize custom content, but
>> simply take an input String (Text) run some Java  and return a new (Text) by
>> calling a function
>>
>> Looking at public class UDFYear extends UDF { the annotation at the top
>> suggests extending UDF and adding the annotation, might be enough.
>>
>> I'll try it anyways...
>> Tim
>>
>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <ad...@immunet.com> wrote:
>>
>>> It sounds like what you want is a custom SerDe.  I have tried to write
>>> one but ran into some difficulty.
>>>
>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>>> <ti...@gmail.com> wrote:
>>> > Thanks Edward,
>>> > You are indeed correct - I am confused!
>>> > So I checked out the source, and poked around.  If I were to extend UDF
>>> and
>>> > implement  public Text evaluate(Text source) {
>>> > would I be heading along the correct lines to use what you say above?
>>> > Thanks,
>>> > Tim
>>> >
>>> >
>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
>>> edlinuxguru@gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>>> >> <ti...@gmail.com> wrote:
>>> >>>
>>> >>> Hi,
>>> >>> I currently run a MapReduce job to rewrite a tab delimited file, and
>>> then
>>> >>> I use Hive for everything after that stage.
>>> >>> Am I correct in thinking that I can create a Jar with my own method
>>> which
>>> >>> can then be called in SQL?
>>> >>> Would the syntax be:
>>> >>>   hive> ADD JAR /tmp/parse.jar;
>>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using 'parse'
>>> as
>>> >>> parsedName;
>>> >>> and parse be a MR job?  If so what are the input and output formats
>>> >>> please for the parse?  Or is it a class implementing an interface
>>> perhaps
>>> >>> and Hive take care of the rest?
>>> >>> Thanks for any pointers,
>>> >>> Tim
>>> >>>
>>> >>
>>> >> Tim,
>>> >>
>>> >> A UDF is an sql function like toString() max()
>>> >> An InputFormat teachers hive to read data from Key Value files
>>> >> A serde tells Hive how to parse input data into columns.
>>> >> Finally, the map()reduce(), transform() keywords you described is a
>>> way to
>>> >> pipe data to external process and read the results back in. Almost
>>> like a
>>> >> non-native to hive UDF.
>>> >>
>>> >> So you have munged up 4 concepts together :) Do not feel bad however,
>>> I
>>> >> struggled though an input format for the last month.
>>> >>
>>> >> It sounds most like you want a udf that takes a string and returns a
>>> >> canonical representation.
>>> >>
>>> >>
>>> >>   hive> ADD JAR /tmp/parse.jar;
>>> >> create temporary function canonical as 'my.package.canonical';
>>> >> select canonical(my colum) from source;
>>> >>
>>> >> Regards,
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Adam J. O'Donnell, Ph.D.
>>> Immunet Corporation
>>> Cell: +1 (267) 251-0070
>>>
>>
>>
> Tim,
>
> I think you are on the right track with the UDF approach.
>
> You could accomplish something similiar with a serdy accept from the client
> prospecting it would be more "transparent".
>
> A UDF is a bit more reusable then a serde. You can only chose a serde once
> when the table is created, but you UDF is applied on the resultset.
>
> Edward
>

Re: Java UDF

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <ti...@gmail.com>wrote:

> Hmmm... I am not trying to serialize or deserialize custom content, but
> simply take an input String (Text) run some Java  and return a new (Text) by
> calling a function
>
> Looking at public class UDFYear extends UDF { the annotation at the top
> suggests extending UDF and adding the annotation, might be enough.
>
> I'll try it anyways...
> Tim
>
> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <ad...@immunet.com> wrote:
>
>> It sounds like what you want is a custom SerDe.  I have tried to write
>> one but ran into some difficulty.
>>
>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>> <ti...@gmail.com> wrote:
>> > Thanks Edward,
>> > You are indeed correct - I am confused!
>> > So I checked out the source, and poked around.  If I were to extend UDF
>> and
>> > implement  public Text evaluate(Text source) {
>> > would I be heading along the correct lines to use what you say above?
>> > Thanks,
>> > Tim
>> >
>> >
>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <edlinuxguru@gmail.com
>> >
>> > wrote:
>> >>
>> >>
>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>> >> <ti...@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>> I currently run a MapReduce job to rewrite a tab delimited file, and
>> then
>> >>> I use Hive for everything after that stage.
>> >>> Am I correct in thinking that I can create a Jar with my own method
>> which
>> >>> can then be called in SQL?
>> >>> Would the syntax be:
>> >>>   hive> ADD JAR /tmp/parse.jar;
>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>> >>> s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
>> >>> parsedName;
>> >>> and parse be a MR job?  If so what are the input and output formats
>> >>> please for the parse?  Or is it a class implementing an interface
>> perhaps
>> >>> and Hive take care of the rest?
>> >>> Thanks for any pointers,
>> >>> Tim
>> >>>
>> >>
>> >> Tim,
>> >>
>> >> A UDF is an sql function like toString() max()
>> >> An InputFormat teachers hive to read data from Key Value files
>> >> A serde tells Hive how to parse input data into columns.
>> >> Finally, the map()reduce(), transform() keywords you described is a way
>> to
>> >> pipe data to external process and read the results back in. Almost like
>> a
>> >> non-native to hive UDF.
>> >>
>> >> So you have munged up 4 concepts together :) Do not feel bad however, I
>> >> struggled though an input format for the last month.
>> >>
>> >> It sounds most like you want a udf that takes a string and returns a
>> >> canonical representation.
>> >>
>> >>
>> >>   hive> ADD JAR /tmp/parse.jar;
>> >> create temporary function canonical as 'my.package.canonical';
>> >> select canonical(my colum) from source;
>> >>
>> >> Regards,
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Adam J. O'Donnell, Ph.D.
>> Immunet Corporation
>> Cell: +1 (267) 251-0070
>>
>
>
Tim,

I think you are on the right track with the UDF approach.

You could accomplish something similiar with a serdy accept from the client
prospecting it would be more "transparent".

A UDF is a bit more reusable then a serde. You can only chose a serde once
when the table is created, but you UDF is applied on the resultset.

Edward

Re: Java UDF

Posted by Tim Robertson <ti...@gmail.com>.
Hmmm... I am not trying to serialize or deserialize custom content, but
simply take an input String (Text) run some Java  and return a new (Text) by
calling a function

Looking at public class UDFYear extends UDF { the annotation at the top
suggests extending UDF and adding the annotation, might be enough.

I'll try it anyways...
Tim

On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <ad...@immunet.com> wrote:

> It sounds like what you want is a custom SerDe.  I have tried to write
> one but ran into some difficulty.
>
> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
> <ti...@gmail.com> wrote:
> > Thanks Edward,
> > You are indeed correct - I am confused!
> > So I checked out the source, and poked around.  If I were to extend UDF
> and
> > implement  public Text evaluate(Text source) {
> > would I be heading along the correct lines to use what you say above?
> > Thanks,
> > Tim
> >
> >
> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <ed...@gmail.com>
> > wrote:
> >>
> >>
> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
> >> <ti...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>> I currently run a MapReduce job to rewrite a tab delimited file, and
> then
> >>> I use Hive for everything after that stage.
> >>> Am I correct in thinking that I can create a Jar with my own method
> which
> >>> can then be called in SQL?
> >>> Would the syntax be:
> >>>   hive> ADD JAR /tmp/parse.jar;
> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
> >>> s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
> >>> parsedName;
> >>> and parse be a MR job?  If so what are the input and output formats
> >>> please for the parse?  Or is it a class implementing an interface
> perhaps
> >>> and Hive take care of the rest?
> >>> Thanks for any pointers,
> >>> Tim
> >>>
> >>
> >> Tim,
> >>
> >> A UDF is an sql function like toString() max()
> >> An InputFormat teachers hive to read data from Key Value files
> >> A serde tells Hive how to parse input data into columns.
> >> Finally, the map()reduce(), transform() keywords you described is a way
> to
> >> pipe data to external process and read the results back in. Almost like
> a
> >> non-native to hive UDF.
> >>
> >> So you have munged up 4 concepts together :) Do not feel bad however, I
> >> struggled though an input format for the last month.
> >>
> >> It sounds most like you want a udf that takes a string and returns a
> >> canonical representation.
> >>
> >>
> >>   hive> ADD JAR /tmp/parse.jar;
> >> create temporary function canonical as 'my.package.canonical';
> >> select canonical(my colum) from source;
> >>
> >> Regards,
> >>
> >>
> >>
> >
> >
>
>
>
> --
> Adam J. O'Donnell, Ph.D.
> Immunet Corporation
> Cell: +1 (267) 251-0070
>

Re: Java UDF

Posted by Adam O'Donnell <ad...@immunet.com>.
It sounds like what you want is a custom SerDe.  I have tried to write
one but ran into some difficulty.

On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
<ti...@gmail.com> wrote:
> Thanks Edward,
> You are indeed correct - I am confused!
> So I checked out the source, and poked around.  If I were to extend UDF and
> implement  public Text evaluate(Text source) {
> would I be heading along the correct lines to use what you say above?
> Thanks,
> Tim
>
>
> On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>>
>> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>> <ti...@gmail.com> wrote:
>>>
>>> Hi,
>>> I currently run a MapReduce job to rewrite a tab delimited file, and then
>>> I use Hive for everything after that stage.
>>> Am I correct in thinking that I can create a Jar with my own method which
>>> can then be called in SQL?
>>> Would the syntax be:
>>>   hive> ADD JAR /tmp/parse.jar;
>>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>> s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
>>> parsedName;
>>> and parse be a MR job?  If so what are the input and output formats
>>> please for the parse?  Or is it a class implementing an interface perhaps
>>> and Hive take care of the rest?
>>> Thanks for any pointers,
>>> Tim
>>>
>>
>> Tim,
>>
>> A UDF is an sql function like toString() max()
>> An InputFormat teachers hive to read data from Key Value files
>> A serde tells Hive how to parse input data into columns.
>> Finally, the map()reduce(), transform() keywords you described is a way to
>> pipe data to external process and read the results back in. Almost like a
>> non-native to hive UDF.
>>
>> So you have munged up 4 concepts together :) Do not feel bad however, I
>> struggled though an input format for the last month.
>>
>> It sounds most like you want a udf that takes a string and returns a
>> canonical representation.
>>
>>
>>   hive> ADD JAR /tmp/parse.jar;
>> create temporary function canonical as 'my.package.canonical';
>> select canonical(my colum) from source;
>>
>> Regards,
>>
>>
>>
>
>



-- 
Adam J. O'Donnell, Ph.D.
Immunet Corporation
Cell: +1 (267) 251-0070

Re: Java UDF

Posted by Tim Robertson <ti...@gmail.com>.
Thanks Edward,

You are indeed correct - I am confused!

So I checked out the source, and poked around.  If I were to extend UDF and
implement  public Text evaluate(Text source) {
would I be heading along the correct lines to use what you say above?

Thanks,
Tim



On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson <timrobertson100@gmail.com
> > wrote:
>
>> Hi,
>>
>> I currently run a MapReduce job to rewrite a tab delimited file, and then
>> I use Hive for everything after that stage.
>>
>> Am I correct in thinking that I can create a Jar with my own method which
>> can then be called in SQL?
>>
>> Would the syntax be:
>>
>>   hive> ADD JAR /tmp/parse.jar;
>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>> s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
>> parsedName;
>>
>> and parse be a MR job?  If so what are the input and output formats please
>> for the parse?  Or is it a class implementing an interface perhaps and Hive
>> take care of the rest?
>>
>> Thanks for any pointers,
>> Tim
>>
>>
>>
> Tim,
>
> A UDF is an sql function like toString() max()
> An InputFormat teachers hive to read data from Key Value files
> A serde tells Hive how to parse input data into columns.
> Finally, the map()reduce(), transform() keywords you described is a way to
> pipe data to external process and read the results back in. Almost like a
> non-native to hive UDF.
>
> So you have munged up 4 concepts together :) Do not feel bad however, I
> struggled though an input format for the last month.
>
> It sounds most like you want a udf that takes a string and returns a
> canonical representation.
>
>
>
>   hive> ADD JAR /tmp/parse.jar;
> create temporary function canonical as 'my.package.canonical';
> select canonical(my colum) from source;
>
> Regards,
>
>
>
>

Re: Java UDF

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
<ti...@gmail.com>wrote:

> Hi,
>
> I currently run a MapReduce job to rewrite a tab delimited file, and then I
> use Hive for everything after that stage.
>
> Am I correct in thinking that I can create a Jar with my own method which
> can then be called in SQL?
>
> Would the syntax be:
>
>   hive> ADD JAR /tmp/parse.jar;
>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
> s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
> parsedName;
>
> and parse be a MR job?  If so what are the input and output formats please
> for the parse?  Or is it a class implementing an interface perhaps and Hive
> take care of the rest?
>
> Thanks for any pointers,
> Tim
>
>
>
Tim,

A UDF is an sql function like toString() max()
An InputFormat teachers hive to read data from Key Value files
A serde tells Hive how to parse input data into columns.
Finally, the map()reduce(), transform() keywords you described is a way to
pipe data to external process and read the results back in. Almost like a
non-native to hive UDF.

So you have munged up 4 concepts together :) Do not feel bad however, I
struggled though an input format for the last month.

It sounds most like you want a udf that takes a string and returns a
canonical representation.


  hive> ADD JAR /tmp/parse.jar;
create temporary function canonical as 'my.package.canonical';
select canonical(my colum) from source;

Regards,