You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Raihan Jamal <ja...@gmail.com> on 2012/08/07 03:39:48 UTC

Custom UserDefinedFunction in Hive

*Problem*

I created the below UserDefinedFunction to get the yesterday's day in the
format I wanted as I will be passing the format into this below method from
the query.



*public final class YesterdayDate extends UDF {*

* *

*                public String evaluate(final String format) { *

*                                DateFormat dateFormat = new
SimpleDateFormat(format); *

*                                Calendar cal = Calendar.getInstance();*

*                                cal.add(Calendar.DATE, -1);     *

*                                return
dateFormat.format(cal.getTime()).toString(); *

*                } *

*}*





So whenever I try to run the query like below by adding the jar to
classpath and creating the temporary function yesterdaydate, I always get
zero result back-



hive> create temporary function *yesterdaydate* as
'com.example.hive.udf.YesterdayDate';

OK

Time taken: 0.512 seconds



Below is the query I am running-



*hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;*

*OK*

* *

And I always get zero result back but the data is there in that table for
Aug 5th.**



What wrong I am doing? Any suggestions will be appreciated.





NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
substitution thing, so I cannot use hiveconf here and the above table has
been partitioned on dt(date) column.**

Re: Custom UserDefinedFunction in Hive

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Have you tried using EXPLAIN[1] on your query? I usually like to use that
to get a better understanding of what my query is actually doing and
debugging at other times.

[1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

On Tue, Aug 7, 2012 at 12:20 PM, Raihan Jamal <ja...@gmail.com> wrote:

> Hi Jan,
>
>
> I figured that out, it is working fine for me now. The only question I
> have is, if I am doing like this-
>
>
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
>
>
> Then the above query will be evaluated as below right?
>
>
>
> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>
>
>
> So that means it will look for data in the corresponding dt partition *(20120806)
> *only right as above table is partitioned on dt column ? And it will not
> scan the whole table right?**
>
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>
>> Hi Jamal,
>>
>> Check if the function really returns what it should and that your data
>> are really in yyyyMMdd format. You can do this by simple query like this:
>>
>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>
>> I don't see anything wrong with the function itself, it works well for me
>> (although I tested it in hive 0.7.1). The only thing I would change about
>> it would be to optimize it by calling 'new' only at the time of
>> construction and reusing the object when the function is called, but that
>> should not affect the functionality at all.
>>
>> Best regards,
>> Jan
>>
>>
>>
>>
>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> *Problem*
>>>
>>> I created the below UserDefinedFunction to get the yesterday's day in
>>> the format I wanted as I will be passing the format into this below method
>>> from the query.
>>>
>>>
>>>
>>> *public final class YesterdayDate extends UDF {*
>>>
>>> * *
>>>
>>> *                public String evaluate(final String format) { *
>>>
>>> *                                DateFormat dateFormat = new
>>> SimpleDateFormat(format); *
>>>
>>> *                                Calendar cal = Calendar.getInstance();*
>>>
>>> *                                cal.add(Calendar.DATE, -1);     *
>>>
>>> *                                return
>>> dateFormat.format(cal.getTime()).toString(); *
>>>
>>> *                } *
>>>
>>> *}*
>>>
>>>
>>>
>>>
>>>
>>> So whenever I try to run the query like below by adding the jar to
>>> classpath and creating the temporary function yesterdaydate, I always get
>>> zero result back-
>>>
>>>
>>>
>>> hive> create temporary function *yesterdaydate* as
>>> 'com.example.hive.udf.YesterdayDate';
>>>
>>> OK
>>>
>>> Time taken: 0.512 seconds
>>>
>>>
>>>
>>> Below is the query I am running-
>>>
>>>
>>>
>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>>> 10;*
>>>
>>> *OK*
>>>
>>> * *
>>>
>>> And I always get zero result back but the data is there in that table
>>> for Aug 5th.**
>>>
>>>
>>>
>>> What wrong I am doing? Any suggestions will be appreciated.
>>>
>>>
>>>
>>>
>>>
>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>> substitution thing, so I cannot use hiveconf here and the above table has
>>> been partitioned on dt(date) column.**
>>>
>>
>>
>


-- 
Swarnim

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
@kulkarni,



When I did explain on my query, I got these things, I am not sure how to
understand these thing. Any help will be appreciated whether my approach is
right or not?-



hive> EXPLAIN SELECT * FROM PDS_ATTRIBUTE_DATA_REALTIME where
dt=yesterdaydate('yyyyMMdd', 2) LIMIT 5;

OK

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_TABREF PDS_ATTRIBUTE_DATA_REALTIME))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL dt)
(TOK_FUNCTION yesterdaydate 'yyyyMMdd' 2))) (TOK_LIMIT 5)))



STAGE DEPENDENCIES:

  Stage-0 is a root stage



STAGE PLANS:

  Stage: Stage-0

    Fetch Operator

      limit: 5





Time taken: 12.126 seconds




*Raihan Jamal*



On Tue, Aug 7, 2012 at 10:56 AM, Jan Dolinár <do...@gmail.com> wrote:

> Oops, sorry I made a copy&paste mistake :) The annotation should read
> @*UDFType(deterministic=true*)
>
> Jan
>
> On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <do...@gmail.com> wrote:
>
>> I'm afraid that  he query
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>> will scan entire table, because the functions is evaluated at runtime, so
>> Hive doesn't know what the value is when it decides which files to scan. I
>> am not 100% sure though, you should try it.
>>
>> Also, you might want to try to add annotation to your UDF saying that the
>> function is deterministic:
>> @*UDFType(deterministic=false*)
>>
>> I think Hive might be clever enough to evaluate it early enough to use
>> the partition pruning correctly, since it operates on constant expression.
>> But again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> optimizations will tell us more. It is actually quite interesting question.
>>
>> Another way to help Hive with the optimizations might be to skip passing
>> the format string argument, if you have all dates in same format, you can
>> call the function just like 'yesterdaydate()' and hardcode the format in
>> the function.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> Hi Jan,
>>>
>>>
>>> I figured that out, it is working fine for me now. The only question I
>>> have is, if I am doing like this-
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>
>>>
>>>
>>> Then the above query will be evaluated as below right?
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>
>>>
>>>
>>> So that means it will look for data in the corresponding dt partition *(20120806)
>>> *only right as above table is partitioned on dt column ? And it will
>>> not scan the whole table right?**
>>>
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>>
>>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>wrote:
>>>
>>>> Hi Jamal,
>>>>
>>>> Check if the function really returns what it should and that your data
>>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>>
>>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>>
>>>> I don't see anything wrong with the function itself, it works well for
>>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>>> about it would be to optimize it by calling 'new' only at the time of
>>>> construction and reusing the object when the function is called, but that
>>>> should not affect the functionality at all.
>>>>
>>>> Best regards,
>>>> Jan
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>>
>>>>> *Problem*
>>>>>
>>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>>> the format I wanted as I will be passing the format into this below method
>>>>> from the query.
>>>>>
>>>>>
>>>>>
>>>>> *public final class YesterdayDate extends UDF {*
>>>>>
>>>>> * *
>>>>>
>>>>> *                public String evaluate(final String format) { *
>>>>>
>>>>> *                                DateFormat dateFormat = new
>>>>> SimpleDateFormat(format); *
>>>>>
>>>>> *                                Calendar cal =
>>>>> Calendar.getInstance();*
>>>>>
>>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>>
>>>>> *                                return
>>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>>
>>>>> *                } *
>>>>>
>>>>> *}*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So whenever I try to run the query like below by adding the jar to
>>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>>> zero result back-
>>>>>
>>>>>
>>>>>
>>>>> hive> create temporary function *yesterdaydate* as
>>>>> 'com.example.hive.udf.YesterdayDate';
>>>>>
>>>>> OK
>>>>>
>>>>> Time taken: 0.512 seconds
>>>>>
>>>>>
>>>>>
>>>>> Below is the query I am running-
>>>>>
>>>>>
>>>>>
>>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>>> LIMIT 10;*
>>>>>
>>>>> *OK*
>>>>>
>>>>> * *
>>>>>
>>>>> And I always get zero result back but the data is there in that table
>>>>> for Aug 5th.**
>>>>>
>>>>>
>>>>>
>>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>>> been partitioned on dt(date) column.**
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
@kulkarni,



When I did explain on my query, I got these things, I am not sure how to
understand these thing. Any help will be appreciated whether my approach is
right or not?-



hive> EXPLAIN SELECT * FROM PDS_ATTRIBUTE_DATA_REALTIME where
dt=yesterdaydate('yyyyMMdd', 2) LIMIT 5;

OK

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_TABREF PDS_ATTRIBUTE_DATA_REALTIME))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL dt)
(TOK_FUNCTION yesterdaydate 'yyyyMMdd' 2))) (TOK_LIMIT 5)))



STAGE DEPENDENCIES:

  Stage-0 is a root stage



STAGE PLANS:

  Stage: Stage-0

    Fetch Operator

      limit: 5





Time taken: 12.126 seconds




*Raihan Jamal*



On Tue, Aug 7, 2012 at 10:56 AM, Jan Dolinár <do...@gmail.com> wrote:

> Oops, sorry I made a copy&paste mistake :) The annotation should read
> @*UDFType(deterministic=true*)
>
> Jan
>
> On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <do...@gmail.com> wrote:
>
>> I'm afraid that  he query
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>> will scan entire table, because the functions is evaluated at runtime, so
>> Hive doesn't know what the value is when it decides which files to scan. I
>> am not 100% sure though, you should try it.
>>
>> Also, you might want to try to add annotation to your UDF saying that the
>> function is deterministic:
>> @*UDFType(deterministic=false*)
>>
>> I think Hive might be clever enough to evaluate it early enough to use
>> the partition pruning correctly, since it operates on constant expression.
>> But again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> optimizations will tell us more. It is actually quite interesting question.
>>
>> Another way to help Hive with the optimizations might be to skip passing
>> the format string argument, if you have all dates in same format, you can
>> call the function just like 'yesterdaydate()' and hardcode the format in
>> the function.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> Hi Jan,
>>>
>>>
>>> I figured that out, it is working fine for me now. The only question I
>>> have is, if I am doing like this-
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>
>>>
>>>
>>> Then the above query will be evaluated as below right?
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>
>>>
>>>
>>> So that means it will look for data in the corresponding dt partition *(20120806)
>>> *only right as above table is partitioned on dt column ? And it will
>>> not scan the whole table right?**
>>>
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>>
>>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>wrote:
>>>
>>>> Hi Jamal,
>>>>
>>>> Check if the function really returns what it should and that your data
>>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>>
>>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>>
>>>> I don't see anything wrong with the function itself, it works well for
>>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>>> about it would be to optimize it by calling 'new' only at the time of
>>>> construction and reusing the object when the function is called, but that
>>>> should not affect the functionality at all.
>>>>
>>>> Best regards,
>>>> Jan
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>>
>>>>> *Problem*
>>>>>
>>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>>> the format I wanted as I will be passing the format into this below method
>>>>> from the query.
>>>>>
>>>>>
>>>>>
>>>>> *public final class YesterdayDate extends UDF {*
>>>>>
>>>>> * *
>>>>>
>>>>> *                public String evaluate(final String format) { *
>>>>>
>>>>> *                                DateFormat dateFormat = new
>>>>> SimpleDateFormat(format); *
>>>>>
>>>>> *                                Calendar cal =
>>>>> Calendar.getInstance();*
>>>>>
>>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>>
>>>>> *                                return
>>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>>
>>>>> *                } *
>>>>>
>>>>> *}*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So whenever I try to run the query like below by adding the jar to
>>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>>> zero result back-
>>>>>
>>>>>
>>>>>
>>>>> hive> create temporary function *yesterdaydate* as
>>>>> 'com.example.hive.udf.YesterdayDate';
>>>>>
>>>>> OK
>>>>>
>>>>> Time taken: 0.512 seconds
>>>>>
>>>>>
>>>>>
>>>>> Below is the query I am running-
>>>>>
>>>>>
>>>>>
>>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>>> LIMIT 10;*
>>>>>
>>>>> *OK*
>>>>>
>>>>> * *
>>>>>
>>>>> And I always get zero result back but the data is there in that table
>>>>> for Aug 5th.**
>>>>>
>>>>>
>>>>>
>>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>>> been partitioned on dt(date) column.**
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
Oops, sorry I made a copy&paste mistake :) The annotation should read
@*UDFType(deterministic=true*)

Jan

On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <do...@gmail.com> wrote:

> I'm afraid that  he query
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
> will scan entire table, because the functions is evaluated at runtime, so
> Hive doesn't know what the value is when it decides which files to scan. I
> am not 100% sure though, you should try it.
>
> Also, you might want to try to add annotation to your UDF saying that the
> function is deterministic:
> @*UDFType(deterministic=false*)
>
> I think Hive might be clever enough to evaluate it early enough to use the
> partition pruning correctly, since it operates on constant expression. But
> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> optimizations will tell us more. It is actually quite interesting question.
>
> Another way to help Hive with the optimizations might be to skip passing
> the format string argument, if you have all dates in same format, you can
> call the function just like 'yesterdaydate()' and hardcode the format in
> the function.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> Hi Jan,
>>
>>
>> I figured that out, it is working fine for me now. The only question I
>> have is, if I am doing like this-
>>
>>
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>>
>>
>> Then the above query will be evaluated as below right?
>>
>>
>>
>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>
>>
>>
>> So that means it will look for data in the corresponding dt partition *(20120806)
>> *only right as above table is partitioned on dt column ? And it will not
>> scan the whole table right?**
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>>
>>> Hi Jamal,
>>>
>>> Check if the function really returns what it should and that your data
>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>
>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>
>>> I don't see anything wrong with the function itself, it works well for
>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>> about it would be to optimize it by calling 'new' only at the time of
>>> construction and reusing the object when the function is called, but that
>>> should not affect the functionality at all.
>>>
>>> Best regards,
>>> Jan
>>>
>>>
>>>
>>>
>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>
>>>> *Problem*
>>>>
>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>> the format I wanted as I will be passing the format into this below method
>>>> from the query.
>>>>
>>>>
>>>>
>>>> *public final class YesterdayDate extends UDF {*
>>>>
>>>> * *
>>>>
>>>> *                public String evaluate(final String format) { *
>>>>
>>>> *                                DateFormat dateFormat = new
>>>> SimpleDateFormat(format); *
>>>>
>>>> *                                Calendar cal = Calendar.getInstance();
>>>> *
>>>>
>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>
>>>> *                                return
>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>
>>>> *                } *
>>>>
>>>> *}*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So whenever I try to run the query like below by adding the jar to
>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>> zero result back-
>>>>
>>>>
>>>>
>>>> hive> create temporary function *yesterdaydate* as
>>>> 'com.example.hive.udf.YesterdayDate';
>>>>
>>>> OK
>>>>
>>>> Time taken: 0.512 seconds
>>>>
>>>>
>>>>
>>>> Below is the query I am running-
>>>>
>>>>
>>>>
>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>> LIMIT 10;*
>>>>
>>>> *OK*
>>>>
>>>> * *
>>>>
>>>> And I always get zero result back but the data is there in that table
>>>> for Aug 5th.**
>>>>
>>>>
>>>>
>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>> been partitioned on dt(date) column.**
>>>>
>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
Right, no need for the function at all. Sorry it is getting late here and
my brain refuses to work any more :)


On Tue, Aug 7, 2012 at 8:39 PM, Techy Teck <co...@gmail.com> wrote:

> Then that means I don't need to create that userdefinedfunction right?
>
>
>
> On Tue, Aug 7, 2012 at 11:32 AM, Jan Dolinár <do...@gmail.com> wrote:
>
>> Hi Jamal,
>>
>> date is standard linux/unix tool, see the manual page:
>> http://linux.die.net/man/1/date.
>>
>> The $(...) tells the shell to execute the command and insert it's output
>> into the string. So in this case it will execute command
>> date -d -1day +%Y%m%d
>> which returns yesterday date in the format you need.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> Yes it supports -e option, but in your query what is date?
>>>
>>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>>> AS 'com.example.hive.udf.YesterdayDate';
>>> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;"
>>>
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>>
>>> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <do...@gmail.com>wrote:
>>>
>>>> By the way, even without hiveconf, you can run hive from shell like
>>>> this to achieve what you want using shell capabilities:
>>>>
>>>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>>>> AS 'com.example.hive.udf.YesterdayDate';
>>>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"
>>>>
>>>> At least if hive 6.0 supports -e option, I don't have where to check
>>>> that.
>>>>
>>>> Jan
>>>>
>>>>
>>>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <te...@gmail.com> wrote:
>>>>
>>>>> Given the implementation of the UDF, I don't think hive would be able
>>>>> to use partition pruning. Especially the version you're using. I'd
>>>>> really recommend upgrading to a later version that has the hiveconf
>>>>> support. That can save a lot of trouble rather than trying to get
>>>>> things working on 0.6
>>>>>
>>>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
>>>>> wrote:
>>>>> > Hi Jan,
>>>>> >
>>>>> > I have date in different format also, so that is the reason I was
>>>>> thinking
>>>>> > to do by this approach. How can I make sure this will work on the
>>>>> selected
>>>>> > partition only and it will not scan the entire table. I will add your
>>>>> > suggestion in my UDF as deterministic thing.
>>>>> >
>>>>> > My simple question here is- How to get the Yesterdays date which I
>>>>> can use
>>>>> > on the Date Partition I cannot use hiveconf here as I am working
>>>>> with Hive
>>>>> > 0.6
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Raihan Jamal
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I'm afraid that  he query
>>>>> >>
>>>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>>> >>
>>>>> >> will scan entire table, because the functions is evaluated at
>>>>> runtime, so
>>>>> >> Hive doesn't know what the value is when it decides which files to
>>>>> scan. I
>>>>> >> am not 100% sure though, you should try it.
>>>>> >>
>>>>> >> Also, you might want to try to add annotation to your UDF saying
>>>>> that the
>>>>> >> function is deterministic:
>>>>> >> @UDFType(deterministic=false)
>>>>> >>
>>>>> >> I think Hive might be clever enough to evaluate it early enough to
>>>>> use the
>>>>> >> partition pruning correctly, since it operates on constant
>>>>> expression. But
>>>>> >> again, I'm not really sure, maybe someone with deeper knowledge of
>>>>> Hive
>>>>> >> optimizations will tell us more. It is actually quite interesting
>>>>> question.
>>>>> >>
>>>>> >> Another way to help Hive with the optimizations might be to skip
>>>>> passing
>>>>> >> the format string argument, if you have all dates in same format,
>>>>> you can
>>>>> >> call the function just like 'yesterdaydate()' and hardcode the
>>>>> format in the
>>>>> >> function.
>>>>> >>
>>>>> >> Jan
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalraihan@gmail.com
>>>>> >
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Hi Jan,
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> I figured that out, it is working fine for me now. The only
>>>>> question I
>>>>> >>> have is, if I am doing like this-
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>>>>> 10;
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> Then the above query will be evaluated as below right?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> So that means it will look for data in the corresponding dt
>>>>> partition
>>>>> >>> (20120806) only right as above table is partitioned on dt column ?
>>>>> And it
>>>>> >>> will not scan the whole table right?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> Raihan Jamal
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi Jamal,
>>>>> >>>>
>>>>> >>>> Check if the function really returns what it should and that your
>>>>> data
>>>>> >>>> are really in yyyyMMdd format. You can do this by simple query
>>>>> like this:
>>>>> >>>>
>>>>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>>> >>>>
>>>>> >>>> I don't see anything wrong with the function itself, it works
>>>>> well for
>>>>> >>>> me (although I tested it in hive 0.7.1). The only thing I would
>>>>> change about
>>>>> >>>> it would be to optimize it by calling 'new' only at the time of
>>>>> construction
>>>>> >>>> and reusing the object when the function is called, but that
>>>>> should not
>>>>> >>>> affect the functionality at all.
>>>>> >>>>
>>>>> >>>> Best regards,
>>>>> >>>> Jan
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <
>>>>> jamalraihan@gmail.com>
>>>>> >>>> wrote:
>>>>> >>>>>
>>>>> >>>>> Problem
>>>>> >>>>>
>>>>> >>>>> I created the below UserDefinedFunction to get the yesterday's
>>>>> day in
>>>>> >>>>> the format I wanted as I will be passing the format into this
>>>>> below method
>>>>> >>>>> from the query.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> public final class YesterdayDate extends UDF {
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>                 public String evaluate(final String format) {
>>>>> >>>>>
>>>>> >>>>>                                 DateFormat dateFormat = new
>>>>> >>>>> SimpleDateFormat(format);
>>>>> >>>>>
>>>>> >>>>>                                 Calendar cal =
>>>>> Calendar.getInstance();
>>>>> >>>>>
>>>>> >>>>>                                 cal.add(Calendar.DATE, -1);
>>>>> >>>>>
>>>>> >>>>>                                 return
>>>>> >>>>> dateFormat.format(cal.getTime()).toString();
>>>>> >>>>>
>>>>> >>>>>                 }
>>>>> >>>>>
>>>>> >>>>> }
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> So whenever I try to run the query like below by adding the jar
>>>>> to
>>>>> >>>>> classpath and creating the temporary function yesterdaydate, I
>>>>> always get
>>>>> >>>>> zero result back-
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> hive> create temporary function yesterdaydate as
>>>>> >>>>> 'com.example.hive.udf.YesterdayDate';
>>>>> >>>>>
>>>>> >>>>> OK
>>>>> >>>>>
>>>>> >>>>> Time taken: 0.512 seconds
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> Below is the query I am running-
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>>> LIMIT
>>>>> >>>>> 10;
>>>>> >>>>>
>>>>> >>>>> OK
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> And I always get zero result back but the data is there in that
>>>>> table
>>>>> >>>>> for Aug 5th.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support
>>>>> variable
>>>>> >>>>> substitution thing, so I cannot use hiveconf here and the above
>>>>> table has
>>>>> >>>>> been partitioned on dt(date) column.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
Let me try that and I will update on this thread.



*Raihan Jamal*



On Tue, Aug 7, 2012 at 11:39 AM, Techy Teck <co...@gmail.com> wrote:

> Then that means I don't need to create that userdefinedfunction right?
>
>
>
> On Tue, Aug 7, 2012 at 11:32 AM, Jan Dolinár <do...@gmail.com> wrote:
>
>> Hi Jamal,
>>
>> date is standard linux/unix tool, see the manual page:
>> http://linux.die.net/man/1/date.
>>
>> The $(...) tells the shell to execute the command and insert it's output
>> into the string. So in this case it will execute command
>> date -d -1day +%Y%m%d
>> which returns yesterday date in the format you need.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> Yes it supports -e option, but in your query what is date?
>>>
>>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>>> AS 'com.example.hive.udf.YesterdayDate';
>>> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;"
>>>
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>>
>>> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <do...@gmail.com>wrote:
>>>
>>>> By the way, even without hiveconf, you can run hive from shell like
>>>> this to achieve what you want using shell capabilities:
>>>>
>>>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>>>> AS 'com.example.hive.udf.YesterdayDate';
>>>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"
>>>>
>>>> At least if hive 6.0 supports -e option, I don't have where to check
>>>> that.
>>>>
>>>> Jan
>>>>
>>>>
>>>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <te...@gmail.com> wrote:
>>>>
>>>>> Given the implementation of the UDF, I don't think hive would be able
>>>>> to use partition pruning. Especially the version you're using. I'd
>>>>> really recommend upgrading to a later version that has the hiveconf
>>>>> support. That can save a lot of trouble rather than trying to get
>>>>> things working on 0.6
>>>>>
>>>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
>>>>> wrote:
>>>>> > Hi Jan,
>>>>> >
>>>>> > I have date in different format also, so that is the reason I was
>>>>> thinking
>>>>> > to do by this approach. How can I make sure this will work on the
>>>>> selected
>>>>> > partition only and it will not scan the entire table. I will add your
>>>>> > suggestion in my UDF as deterministic thing.
>>>>> >
>>>>> > My simple question here is- How to get the Yesterdays date which I
>>>>> can use
>>>>> > on the Date Partition I cannot use hiveconf here as I am working
>>>>> with Hive
>>>>> > 0.6
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Raihan Jamal
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I'm afraid that  he query
>>>>> >>
>>>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>>> >>
>>>>> >> will scan entire table, because the functions is evaluated at
>>>>> runtime, so
>>>>> >> Hive doesn't know what the value is when it decides which files to
>>>>> scan. I
>>>>> >> am not 100% sure though, you should try it.
>>>>> >>
>>>>> >> Also, you might want to try to add annotation to your UDF saying
>>>>> that the
>>>>> >> function is deterministic:
>>>>> >> @UDFType(deterministic=false)
>>>>> >>
>>>>> >> I think Hive might be clever enough to evaluate it early enough to
>>>>> use the
>>>>> >> partition pruning correctly, since it operates on constant
>>>>> expression. But
>>>>> >> again, I'm not really sure, maybe someone with deeper knowledge of
>>>>> Hive
>>>>> >> optimizations will tell us more. It is actually quite interesting
>>>>> question.
>>>>> >>
>>>>> >> Another way to help Hive with the optimizations might be to skip
>>>>> passing
>>>>> >> the format string argument, if you have all dates in same format,
>>>>> you can
>>>>> >> call the function just like 'yesterdaydate()' and hardcode the
>>>>> format in the
>>>>> >> function.
>>>>> >>
>>>>> >> Jan
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalraihan@gmail.com
>>>>> >
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Hi Jan,
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> I figured that out, it is working fine for me now. The only
>>>>> question I
>>>>> >>> have is, if I am doing like this-
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>>>>> 10;
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> Then the above query will be evaluated as below right?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> So that means it will look for data in the corresponding dt
>>>>> partition
>>>>> >>> (20120806) only right as above table is partitioned on dt column ?
>>>>> And it
>>>>> >>> will not scan the whole table right?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> Raihan Jamal
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi Jamal,
>>>>> >>>>
>>>>> >>>> Check if the function really returns what it should and that your
>>>>> data
>>>>> >>>> are really in yyyyMMdd format. You can do this by simple query
>>>>> like this:
>>>>> >>>>
>>>>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>>> >>>>
>>>>> >>>> I don't see anything wrong with the function itself, it works
>>>>> well for
>>>>> >>>> me (although I tested it in hive 0.7.1). The only thing I would
>>>>> change about
>>>>> >>>> it would be to optimize it by calling 'new' only at the time of
>>>>> construction
>>>>> >>>> and reusing the object when the function is called, but that
>>>>> should not
>>>>> >>>> affect the functionality at all.
>>>>> >>>>
>>>>> >>>> Best regards,
>>>>> >>>> Jan
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <
>>>>> jamalraihan@gmail.com>
>>>>> >>>> wrote:
>>>>> >>>>>
>>>>> >>>>> Problem
>>>>> >>>>>
>>>>> >>>>> I created the below UserDefinedFunction to get the yesterday's
>>>>> day in
>>>>> >>>>> the format I wanted as I will be passing the format into this
>>>>> below method
>>>>> >>>>> from the query.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> public final class YesterdayDate extends UDF {
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>                 public String evaluate(final String format) {
>>>>> >>>>>
>>>>> >>>>>                                 DateFormat dateFormat = new
>>>>> >>>>> SimpleDateFormat(format);
>>>>> >>>>>
>>>>> >>>>>                                 Calendar cal =
>>>>> Calendar.getInstance();
>>>>> >>>>>
>>>>> >>>>>                                 cal.add(Calendar.DATE, -1);
>>>>> >>>>>
>>>>> >>>>>                                 return
>>>>> >>>>> dateFormat.format(cal.getTime()).toString();
>>>>> >>>>>
>>>>> >>>>>                 }
>>>>> >>>>>
>>>>> >>>>> }
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> So whenever I try to run the query like below by adding the jar
>>>>> to
>>>>> >>>>> classpath and creating the temporary function yesterdaydate, I
>>>>> always get
>>>>> >>>>> zero result back-
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> hive> create temporary function yesterdaydate as
>>>>> >>>>> 'com.example.hive.udf.YesterdayDate';
>>>>> >>>>>
>>>>> >>>>> OK
>>>>> >>>>>
>>>>> >>>>> Time taken: 0.512 seconds
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> Below is the query I am running-
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>>> LIMIT
>>>>> >>>>> 10;
>>>>> >>>>>
>>>>> >>>>> OK
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> And I always get zero result back but the data is there in that
>>>>> table
>>>>> >>>>> for Aug 5th.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support
>>>>> variable
>>>>> >>>>> substitution thing, so I cannot use hiveconf here and the above
>>>>> table has
>>>>> >>>>> been partitioned on dt(date) column.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Techy Teck <co...@gmail.com>.
Then that means I don't need to create that userdefinedfunction right?



On Tue, Aug 7, 2012 at 11:32 AM, Jan Dolinár <do...@gmail.com> wrote:

> Hi Jamal,
>
> date is standard linux/unix tool, see the manual page:
> http://linux.die.net/man/1/date.
>
> The $(...) tells the shell to execute the command and insert it's output
> into the string. So in this case it will execute command
> date -d -1day +%Y%m%d
> which returns yesterday date in the format you need.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> Yes it supports -e option, but in your query what is date?
>>
>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>> AS 'com.example.hive.udf.YesterdayDate';
>> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;"
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <do...@gmail.com> wrote:
>>
>>> By the way, even without hiveconf, you can run hive from shell like this
>>> to achieve what you want using shell capabilities:
>>>
>>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>>> AS 'com.example.hive.udf.YesterdayDate';
>>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"
>>>
>>> At least if hive 6.0 supports -e option, I don't have where to check
>>> that.
>>>
>>> Jan
>>>
>>>
>>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <te...@gmail.com> wrote:
>>>
>>>> Given the implementation of the UDF, I don't think hive would be able
>>>> to use partition pruning. Especially the version you're using. I'd
>>>> really recommend upgrading to a later version that has the hiveconf
>>>> support. That can save a lot of trouble rather than trying to get
>>>> things working on 0.6
>>>>
>>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
>>>> wrote:
>>>> > Hi Jan,
>>>> >
>>>> > I have date in different format also, so that is the reason I was
>>>> thinking
>>>> > to do by this approach. How can I make sure this will work on the
>>>> selected
>>>> > partition only and it will not scan the entire table. I will add your
>>>> > suggestion in my UDF as deterministic thing.
>>>> >
>>>> > My simple question here is- How to get the Yesterdays date which I
>>>> can use
>>>> > on the Date Partition I cannot use hiveconf here as I am working with
>>>> Hive
>>>> > 0.6
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Raihan Jamal
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I'm afraid that  he query
>>>> >>
>>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>> >>
>>>> >> will scan entire table, because the functions is evaluated at
>>>> runtime, so
>>>> >> Hive doesn't know what the value is when it decides which files to
>>>> scan. I
>>>> >> am not 100% sure though, you should try it.
>>>> >>
>>>> >> Also, you might want to try to add annotation to your UDF saying
>>>> that the
>>>> >> function is deterministic:
>>>> >> @UDFType(deterministic=false)
>>>> >>
>>>> >> I think Hive might be clever enough to evaluate it early enough to
>>>> use the
>>>> >> partition pruning correctly, since it operates on constant
>>>> expression. But
>>>> >> again, I'm not really sure, maybe someone with deeper knowledge of
>>>> Hive
>>>> >> optimizations will tell us more. It is actually quite interesting
>>>> question.
>>>> >>
>>>> >> Another way to help Hive with the optimizations might be to skip
>>>> passing
>>>> >> the format string argument, if you have all dates in same format,
>>>> you can
>>>> >> call the function just like 'yesterdaydate()' and hardcode the
>>>> format in the
>>>> >> function.
>>>> >>
>>>> >> Jan
>>>> >>
>>>> >>
>>>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi Jan,
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> I figured that out, it is working fine for me now. The only
>>>> question I
>>>> >>> have is, if I am doing like this-
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Then the above query will be evaluated as below right?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> So that means it will look for data in the corresponding dt
>>>> partition
>>>> >>> (20120806) only right as above table is partitioned on dt column ?
>>>> And it
>>>> >>> will not scan the whole table right?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Raihan Jamal
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> Hi Jamal,
>>>> >>>>
>>>> >>>> Check if the function really returns what it should and that your
>>>> data
>>>> >>>> are really in yyyyMMdd format. You can do this by simple query
>>>> like this:
>>>> >>>>
>>>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>> >>>>
>>>> >>>> I don't see anything wrong with the function itself, it works well
>>>> for
>>>> >>>> me (although I tested it in hive 0.7.1). The only thing I would
>>>> change about
>>>> >>>> it would be to optimize it by calling 'new' only at the time of
>>>> construction
>>>> >>>> and reusing the object when the function is called, but that
>>>> should not
>>>> >>>> affect the functionality at all.
>>>> >>>>
>>>> >>>> Best regards,
>>>> >>>> Jan
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <
>>>> jamalraihan@gmail.com>
>>>> >>>> wrote:
>>>> >>>>>
>>>> >>>>> Problem
>>>> >>>>>
>>>> >>>>> I created the below UserDefinedFunction to get the yesterday's
>>>> day in
>>>> >>>>> the format I wanted as I will be passing the format into this
>>>> below method
>>>> >>>>> from the query.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> public final class YesterdayDate extends UDF {
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>                 public String evaluate(final String format) {
>>>> >>>>>
>>>> >>>>>                                 DateFormat dateFormat = new
>>>> >>>>> SimpleDateFormat(format);
>>>> >>>>>
>>>> >>>>>                                 Calendar cal =
>>>> Calendar.getInstance();
>>>> >>>>>
>>>> >>>>>                                 cal.add(Calendar.DATE, -1);
>>>> >>>>>
>>>> >>>>>                                 return
>>>> >>>>> dateFormat.format(cal.getTime()).toString();
>>>> >>>>>
>>>> >>>>>                 }
>>>> >>>>>
>>>> >>>>> }
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> So whenever I try to run the query like below by adding the jar to
>>>> >>>>> classpath and creating the temporary function yesterdaydate, I
>>>> always get
>>>> >>>>> zero result back-
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> hive> create temporary function yesterdaydate as
>>>> >>>>> 'com.example.hive.udf.YesterdayDate';
>>>> >>>>>
>>>> >>>>> OK
>>>> >>>>>
>>>> >>>>> Time taken: 0.512 seconds
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> Below is the query I am running-
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>> LIMIT
>>>> >>>>> 10;
>>>> >>>>>
>>>> >>>>> OK
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> And I always get zero result back but the data is there in that
>>>> table
>>>> >>>>> for Aug 5th.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support
>>>> variable
>>>> >>>>> substitution thing, so I cannot use hiveconf here and the above
>>>> table has
>>>> >>>>> been partitioned on dt(date) column.
>>>> >>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
Hi Jamal,

date is standard linux/unix tool, see the manual page:
http://linux.die.net/man/1/date.

The $(...) tells the shell to execute the command and insert it's output
into the string. So in this case it will execute command
date -d -1day +%Y%m%d
which returns yesterday date in the format you need.

Jan


On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <ja...@gmail.com> wrote:

> Yes it supports -e option, but in your query what is date?
>
> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
> AS 'com.example.hive.udf.YesterdayDate';
> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;"
>
>
>
> *Raihan Jamal*
>
>
>
> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <do...@gmail.com> wrote:
>
>> By the way, even without hiveconf, you can run hive from shell like this
>> to achieve what you want using shell capabilities:
>>
>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>> AS 'com.example.hive.udf.YesterdayDate';
>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"
>>
>> At least if hive 6.0 supports -e option, I don't have where to check that.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <te...@gmail.com> wrote:
>>
>>> Given the implementation of the UDF, I don't think hive would be able
>>> to use partition pruning. Especially the version you're using. I'd
>>> really recommend upgrading to a later version that has the hiveconf
>>> support. That can save a lot of trouble rather than trying to get
>>> things working on 0.6
>>>
>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
>>> wrote:
>>> > Hi Jan,
>>> >
>>> > I have date in different format also, so that is the reason I was
>>> thinking
>>> > to do by this approach. How can I make sure this will work on the
>>> selected
>>> > partition only and it will not scan the entire table. I will add your
>>> > suggestion in my UDF as deterministic thing.
>>> >
>>> > My simple question here is- How to get the Yesterdays date which I can
>>> use
>>> > on the Date Partition I cannot use hiveconf here as I am working with
>>> Hive
>>> > 0.6
>>> >
>>> >
>>> >
>>> >
>>> > Raihan Jamal
>>> >
>>> >
>>> >
>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
>>> wrote:
>>> >>
>>> >> I'm afraid that  he query
>>> >>
>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>> >>
>>> >> will scan entire table, because the functions is evaluated at
>>> runtime, so
>>> >> Hive doesn't know what the value is when it decides which files to
>>> scan. I
>>> >> am not 100% sure though, you should try it.
>>> >>
>>> >> Also, you might want to try to add annotation to your UDF saying that
>>> the
>>> >> function is deterministic:
>>> >> @UDFType(deterministic=false)
>>> >>
>>> >> I think Hive might be clever enough to evaluate it early enough to
>>> use the
>>> >> partition pruning correctly, since it operates on constant
>>> expression. But
>>> >> again, I'm not really sure, maybe someone with deeper knowledge of
>>> Hive
>>> >> optimizations will tell us more. It is actually quite interesting
>>> question.
>>> >>
>>> >> Another way to help Hive with the optimizations might be to skip
>>> passing
>>> >> the format string argument, if you have all dates in same format, you
>>> can
>>> >> call the function just like 'yesterdaydate()' and hardcode the format
>>> in the
>>> >> function.
>>> >>
>>> >> Jan
>>> >>
>>> >>
>>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi Jan,
>>> >>>
>>> >>>
>>> >>>
>>> >>> I figured that out, it is working fine for me now. The only question
>>> I
>>> >>> have is, if I am doing like this-
>>> >>>
>>> >>>
>>> >>>
>>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>> >>>
>>> >>>
>>> >>>
>>> >>> Then the above query will be evaluated as below right?
>>> >>>
>>> >>>
>>> >>>
>>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>> >>>
>>> >>>
>>> >>>
>>> >>> So that means it will look for data in the corresponding dt partition
>>> >>> (20120806) only right as above table is partitioned on dt column ?
>>> And it
>>> >>> will not scan the whole table right?
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> Raihan Jamal
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> Hi Jamal,
>>> >>>>
>>> >>>> Check if the function really returns what it should and that your
>>> data
>>> >>>> are really in yyyyMMdd format. You can do this by simple query like
>>> this:
>>> >>>>
>>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>> >>>>
>>> >>>> I don't see anything wrong with the function itself, it works well
>>> for
>>> >>>> me (although I tested it in hive 0.7.1). The only thing I would
>>> change about
>>> >>>> it would be to optimize it by calling 'new' only at the time of
>>> construction
>>> >>>> and reusing the object when the function is called, but that should
>>> not
>>> >>>> affect the functionality at all.
>>> >>>>
>>> >>>> Best regards,
>>> >>>> Jan
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalraihan@gmail.com
>>> >
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Problem
>>> >>>>>
>>> >>>>> I created the below UserDefinedFunction to get the yesterday's day
>>> in
>>> >>>>> the format I wanted as I will be passing the format into this
>>> below method
>>> >>>>> from the query.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> public final class YesterdayDate extends UDF {
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>                 public String evaluate(final String format) {
>>> >>>>>
>>> >>>>>                                 DateFormat dateFormat = new
>>> >>>>> SimpleDateFormat(format);
>>> >>>>>
>>> >>>>>                                 Calendar cal =
>>> Calendar.getInstance();
>>> >>>>>
>>> >>>>>                                 cal.add(Calendar.DATE, -1);
>>> >>>>>
>>> >>>>>                                 return
>>> >>>>> dateFormat.format(cal.getTime()).toString();
>>> >>>>>
>>> >>>>>                 }
>>> >>>>>
>>> >>>>> }
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> So whenever I try to run the query like below by adding the jar to
>>> >>>>> classpath and creating the temporary function yesterdaydate, I
>>> always get
>>> >>>>> zero result back-
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> hive> create temporary function yesterdaydate as
>>> >>>>> 'com.example.hive.udf.YesterdayDate';
>>> >>>>>
>>> >>>>> OK
>>> >>>>>
>>> >>>>> Time taken: 0.512 seconds
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Below is the query I am running-
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>> LIMIT
>>> >>>>> 10;
>>> >>>>>
>>> >>>>> OK
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> And I always get zero result back but the data is there in that
>>> table
>>> >>>>> for Aug 5th.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> What wrong I am doing? Any suggestions will be appreciated.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>> >>>>> substitution thing, so I cannot use hiveconf here and the above
>>> table has
>>> >>>>> been partitioned on dt(date) column.
>>> >>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
Yes it supports -e option, but in your query what is date?

hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
AS 'com.example.hive.udf.YesterdayDate';
SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;"



*Raihan Jamal*



On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <do...@gmail.com> wrote:

> By the way, even without hiveconf, you can run hive from shell like this
> to achieve what you want using shell capabilities:
>
> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
> AS 'com.example.hive.udf.YesterdayDate';
> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"
>
> At least if hive 6.0 supports -e option, I don't have where to check that.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <te...@gmail.com> wrote:
>
>> Given the implementation of the UDF, I don't think hive would be able
>> to use partition pruning. Especially the version you're using. I'd
>> really recommend upgrading to a later version that has the hiveconf
>> support. That can save a lot of trouble rather than trying to get
>> things working on 0.6
>>
>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
>> wrote:
>> > Hi Jan,
>> >
>> > I have date in different format also, so that is the reason I was
>> thinking
>> > to do by this approach. How can I make sure this will work on the
>> selected
>> > partition only and it will not scan the entire table. I will add your
>> > suggestion in my UDF as deterministic thing.
>> >
>> > My simple question here is- How to get the Yesterdays date which I can
>> use
>> > on the Date Partition I cannot use hiveconf here as I am working with
>> Hive
>> > 0.6
>> >
>> >
>> >
>> >
>> > Raihan Jamal
>> >
>> >
>> >
>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
>> wrote:
>> >>
>> >> I'm afraid that  he query
>> >>
>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>> >>
>> >> will scan entire table, because the functions is evaluated at runtime,
>> so
>> >> Hive doesn't know what the value is when it decides which files to
>> scan. I
>> >> am not 100% sure though, you should try it.
>> >>
>> >> Also, you might want to try to add annotation to your UDF saying that
>> the
>> >> function is deterministic:
>> >> @UDFType(deterministic=false)
>> >>
>> >> I think Hive might be clever enough to evaluate it early enough to use
>> the
>> >> partition pruning correctly, since it operates on constant expression.
>> But
>> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> >> optimizations will tell us more. It is actually quite interesting
>> question.
>> >>
>> >> Another way to help Hive with the optimizations might be to skip
>> passing
>> >> the format string argument, if you have all dates in same format, you
>> can
>> >> call the function just like 'yesterdaydate()' and hardcode the format
>> in the
>> >> function.
>> >>
>> >> Jan
>> >>
>> >>
>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi Jan,
>> >>>
>> >>>
>> >>>
>> >>> I figured that out, it is working fine for me now. The only question I
>> >>> have is, if I am doing like this-
>> >>>
>> >>>
>> >>>
>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>> >>>
>> >>>
>> >>>
>> >>> Then the above query will be evaluated as below right?
>> >>>
>> >>>
>> >>>
>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>> >>>
>> >>>
>> >>>
>> >>> So that means it will look for data in the corresponding dt partition
>> >>> (20120806) only right as above table is partitioned on dt column ?
>> And it
>> >>> will not scan the whole table right?
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Raihan Jamal
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
>> wrote:
>> >>>>
>> >>>> Hi Jamal,
>> >>>>
>> >>>> Check if the function really returns what it should and that your
>> data
>> >>>> are really in yyyyMMdd format. You can do this by simple query like
>> this:
>> >>>>
>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>> >>>>
>> >>>> I don't see anything wrong with the function itself, it works well
>> for
>> >>>> me (although I tested it in hive 0.7.1). The only thing I would
>> change about
>> >>>> it would be to optimize it by calling 'new' only at the time of
>> construction
>> >>>> and reusing the object when the function is called, but that should
>> not
>> >>>> affect the functionality at all.
>> >>>>
>> >>>> Best regards,
>> >>>> Jan
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Problem
>> >>>>>
>> >>>>> I created the below UserDefinedFunction to get the yesterday's day
>> in
>> >>>>> the format I wanted as I will be passing the format into this below
>> method
>> >>>>> from the query.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> public final class YesterdayDate extends UDF {
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>                 public String evaluate(final String format) {
>> >>>>>
>> >>>>>                                 DateFormat dateFormat = new
>> >>>>> SimpleDateFormat(format);
>> >>>>>
>> >>>>>                                 Calendar cal =
>> Calendar.getInstance();
>> >>>>>
>> >>>>>                                 cal.add(Calendar.DATE, -1);
>> >>>>>
>> >>>>>                                 return
>> >>>>> dateFormat.format(cal.getTime()).toString();
>> >>>>>
>> >>>>>                 }
>> >>>>>
>> >>>>> }
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> So whenever I try to run the query like below by adding the jar to
>> >>>>> classpath and creating the temporary function yesterdaydate, I
>> always get
>> >>>>> zero result back-
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> hive> create temporary function yesterdaydate as
>> >>>>> 'com.example.hive.udf.YesterdayDate';
>> >>>>>
>> >>>>> OK
>> >>>>>
>> >>>>> Time taken: 0.512 seconds
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Below is the query I am running-
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>> LIMIT
>> >>>>> 10;
>> >>>>>
>> >>>>> OK
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> And I always get zero result back but the data is there in that
>> table
>> >>>>> for Aug 5th.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> What wrong I am doing? Any suggestions will be appreciated.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>> >>>>> substitution thing, so I cannot use hiveconf here and the above
>> table has
>> >>>>> been partitioned on dt(date) column.
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
By the way, even without hiveconf, you can run hive from shell like this to
achieve what you want using shell capabilities:

hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
AS 'com.example.hive.udf.YesterdayDate';
SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"

At least if hive 6.0 supports -e option, I don't have where to check that.

Jan


On Tue, Aug 7, 2012 at 8:11 PM, Vijay <te...@gmail.com> wrote:

> Given the implementation of the UDF, I don't think hive would be able
> to use partition pruning. Especially the version you're using. I'd
> really recommend upgrading to a later version that has the hiveconf
> support. That can save a lot of trouble rather than trying to get
> things working on 0.6
>
> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
> wrote:
> > Hi Jan,
> >
> > I have date in different format also, so that is the reason I was
> thinking
> > to do by this approach. How can I make sure this will work on the
> selected
> > partition only and it will not scan the entire table. I will add your
> > suggestion in my UDF as deterministic thing.
> >
> > My simple question here is- How to get the Yesterdays date which I can
> use
> > on the Date Partition I cannot use hiveconf here as I am working with
> Hive
> > 0.6
> >
> >
> >
> >
> > Raihan Jamal
> >
> >
> >
> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
> wrote:
> >>
> >> I'm afraid that  he query
> >>
> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>
> >> will scan entire table, because the functions is evaluated at runtime,
> so
> >> Hive doesn't know what the value is when it decides which files to
> scan. I
> >> am not 100% sure though, you should try it.
> >>
> >> Also, you might want to try to add annotation to your UDF saying that
> the
> >> function is deterministic:
> >> @UDFType(deterministic=false)
> >>
> >> I think Hive might be clever enough to evaluate it early enough to use
> the
> >> partition pruning correctly, since it operates on constant expression.
> But
> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> >> optimizations will tell us more. It is actually quite interesting
> question.
> >>
> >> Another way to help Hive with the optimizations might be to skip passing
> >> the format string argument, if you have all dates in same format, you
> can
> >> call the function just like 'yesterdaydate()' and hardcode the format
> in the
> >> function.
> >>
> >> Jan
> >>
> >>
> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>
> >> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>>
> >>>
> >>> I figured that out, it is working fine for me now. The only question I
> >>> have is, if I am doing like this-
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>>
> >>>
> >>>
> >>> Then the above query will be evaluated as below right?
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
> >>>
> >>>
> >>>
> >>> So that means it will look for data in the corresponding dt partition
> >>> (20120806) only right as above table is partitioned on dt column ? And
> it
> >>> will not scan the whole table right?
> >>>
> >>>
> >>>
> >>>
> >>> Raihan Jamal
> >>>
> >>>
> >>>
> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Jamal,
> >>>>
> >>>> Check if the function really returns what it should and that your data
> >>>> are really in yyyyMMdd format. You can do this by simple query like
> this:
> >>>>
> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
> >>>>
> >>>> I don't see anything wrong with the function itself, it works well for
> >>>> me (although I tested it in hive 0.7.1). The only thing I would
> change about
> >>>> it would be to optimize it by calling 'new' only at the time of
> construction
> >>>> and reusing the object when the function is called, but that should
> not
> >>>> affect the functionality at all.
> >>>>
> >>>> Best regards,
> >>>> Jan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Problem
> >>>>>
> >>>>> I created the below UserDefinedFunction to get the yesterday's day in
> >>>>> the format I wanted as I will be passing the format into this below
> method
> >>>>> from the query.
> >>>>>
> >>>>>
> >>>>>
> >>>>> public final class YesterdayDate extends UDF {
> >>>>>
> >>>>>
> >>>>>
> >>>>>                 public String evaluate(final String format) {
> >>>>>
> >>>>>                                 DateFormat dateFormat = new
> >>>>> SimpleDateFormat(format);
> >>>>>
> >>>>>                                 Calendar cal =
> Calendar.getInstance();
> >>>>>
> >>>>>                                 cal.add(Calendar.DATE, -1);
> >>>>>
> >>>>>                                 return
> >>>>> dateFormat.format(cal.getTime()).toString();
> >>>>>
> >>>>>                 }
> >>>>>
> >>>>> }
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> So whenever I try to run the query like below by adding the jar to
> >>>>> classpath and creating the temporary function yesterdaydate, I
> always get
> >>>>> zero result back-
> >>>>>
> >>>>>
> >>>>>
> >>>>> hive> create temporary function yesterdaydate as
> >>>>> 'com.example.hive.udf.YesterdayDate';
> >>>>>
> >>>>> OK
> >>>>>
> >>>>> Time taken: 0.512 seconds
> >>>>>
> >>>>>
> >>>>>
> >>>>> Below is the query I am running-
> >>>>>
> >>>>>
> >>>>>
> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
> LIMIT
> >>>>> 10;
> >>>>>
> >>>>> OK
> >>>>>
> >>>>>
> >>>>>
> >>>>> And I always get zero result back but the data is there in that table
> >>>>> for Aug 5th.
> >>>>>
> >>>>>
> >>>>>
> >>>>> What wrong I am doing? Any suggestions will be appreciated.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
> >>>>> substitution thing, so I cannot use hiveconf here and the above
> table has
> >>>>> been partitioned on dt(date) column.
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: Custom UserDefinedFunction in Hive

Posted by Vijay <te...@gmail.com>.
You actually don't need hive on the whole cluster. That's the beauty
of it. You only need it on the client machine where you're submitting
hive jobs. Of course the metadata store does need to be upgraded for
newer versions so that might still be a problem.

On Tue, Aug 7, 2012 at 11:26 AM, Raihan Jamal <ja...@gmail.com> wrote:
> Hi Vijay,
>
> Thanks for the suggestion,  If upgrading to Hive was under my control then I
> would have done for sure, but I am working in a company and they are running
> Hive 0.6 on all the cluster, And I told them to upgrade the Hive version but
> they said it will take few months for them to do this. And I don't know why
> they are saying like this, so that is the reason I was doing like this.
>
> Any suggestions will be appreciated to make this thing work
>
>
> Raihan Jamal
>
>
>
> On Tue, Aug 7, 2012 at 11:11 AM, Vijay <te...@gmail.com> wrote:
>>
>> Given the implementation of the UDF, I don't think hive would be able
>> to use partition pruning. Especially the version you're using. I'd
>> really recommend upgrading to a later version that has the hiveconf
>> support. That can save a lot of trouble rather than trying to get
>> things working on 0.6
>>
>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
>> wrote:
>> > Hi Jan,
>> >
>> > I have date in different format also, so that is the reason I was
>> > thinking
>> > to do by this approach. How can I make sure this will work on the
>> > selected
>> > partition only and it will not scan the entire table. I will add your
>> > suggestion in my UDF as deterministic thing.
>> >
>> > My simple question here is- How to get the Yesterdays date which I can
>> > use
>> > on the Date Partition I cannot use hiveconf here as I am working with
>> > Hive
>> > 0.6
>> >
>> >
>> >
>> >
>> > Raihan Jamal
>> >
>> >
>> >
>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
>> > wrote:
>> >>
>> >> I'm afraid that  he query
>> >>
>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>> >>
>> >> will scan entire table, because the functions is evaluated at runtime,
>> >> so
>> >> Hive doesn't know what the value is when it decides which files to
>> >> scan. I
>> >> am not 100% sure though, you should try it.
>> >>
>> >> Also, you might want to try to add annotation to your UDF saying that
>> >> the
>> >> function is deterministic:
>> >> @UDFType(deterministic=false)
>> >>
>> >> I think Hive might be clever enough to evaluate it early enough to use
>> >> the
>> >> partition pruning correctly, since it operates on constant expression.
>> >> But
>> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> >> optimizations will tell us more. It is actually quite interesting
>> >> question.
>> >>
>> >> Another way to help Hive with the optimizations might be to skip
>> >> passing
>> >> the format string argument, if you have all dates in same format, you
>> >> can
>> >> call the function just like 'yesterdaydate()' and hardcode the format
>> >> in the
>> >> function.
>> >>
>> >> Jan
>> >>
>> >>
>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi Jan,
>> >>>
>> >>>
>> >>>
>> >>> I figured that out, it is working fine for me now. The only question I
>> >>> have is, if I am doing like this-
>> >>>
>> >>>
>> >>>
>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>> >>>
>> >>>
>> >>>
>> >>> Then the above query will be evaluated as below right?
>> >>>
>> >>>
>> >>>
>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>> >>>
>> >>>
>> >>>
>> >>> So that means it will look for data in the corresponding dt partition
>> >>> (20120806) only right as above table is partitioned on dt column ? And
>> >>> it
>> >>> will not scan the whole table right?
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Raihan Jamal
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi Jamal,
>> >>>>
>> >>>> Check if the function really returns what it should and that your
>> >>>> data
>> >>>> are really in yyyyMMdd format. You can do this by simple query like
>> >>>> this:
>> >>>>
>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>> >>>>
>> >>>> I don't see anything wrong with the function itself, it works well
>> >>>> for
>> >>>> me (although I tested it in hive 0.7.1). The only thing I would
>> >>>> change about
>> >>>> it would be to optimize it by calling 'new' only at the time of
>> >>>> construction
>> >>>> and reusing the object when the function is called, but that should
>> >>>> not
>> >>>> affect the functionality at all.
>> >>>>
>> >>>> Best regards,
>> >>>> Jan
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Problem
>> >>>>>
>> >>>>> I created the below UserDefinedFunction to get the yesterday's day
>> >>>>> in
>> >>>>> the format I wanted as I will be passing the format into this below
>> >>>>> method
>> >>>>> from the query.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> public final class YesterdayDate extends UDF {
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>                 public String evaluate(final String format) {
>> >>>>>
>> >>>>>                                 DateFormat dateFormat = new
>> >>>>> SimpleDateFormat(format);
>> >>>>>
>> >>>>>                                 Calendar cal =
>> >>>>> Calendar.getInstance();
>> >>>>>
>> >>>>>                                 cal.add(Calendar.DATE, -1);
>> >>>>>
>> >>>>>                                 return
>> >>>>> dateFormat.format(cal.getTime()).toString();
>> >>>>>
>> >>>>>                 }
>> >>>>>
>> >>>>> }
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> So whenever I try to run the query like below by adding the jar to
>> >>>>> classpath and creating the temporary function yesterdaydate, I
>> >>>>> always get
>> >>>>> zero result back-
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> hive> create temporary function yesterdaydate as
>> >>>>> 'com.example.hive.udf.YesterdayDate';
>> >>>>>
>> >>>>> OK
>> >>>>>
>> >>>>> Time taken: 0.512 seconds
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Below is the query I am running-
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>> >>>>> LIMIT
>> >>>>> 10;
>> >>>>>
>> >>>>> OK
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> And I always get zero result back but the data is there in that
>> >>>>> table
>> >>>>> for Aug 5th.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> What wrong I am doing? Any suggestions will be appreciated.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>> >>>>> substitution thing, so I cannot use hiveconf here and the above
>> >>>>> table has
>> >>>>> been partitioned on dt(date) column.
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
Hi Vijay,

Thanks for the suggestion,  If upgrading to Hive was under my control then
I would have done for sure, but I am working in a company and they are
running Hive 0.6 on all the cluster, And I told them to upgrade the Hive
version but they said it will take few months for them to do this. And I
don't know why they are saying like this, so that is the reason I was doing
like this.

Any suggestions will be appreciated to make this thing work


*Raihan Jamal*



On Tue, Aug 7, 2012 at 11:11 AM, Vijay <te...@gmail.com> wrote:

> Given the implementation of the UDF, I don't think hive would be able
> to use partition pruning. Especially the version you're using. I'd
> really recommend upgrading to a later version that has the hiveconf
> support. That can save a lot of trouble rather than trying to get
> things working on 0.6
>
> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com>
> wrote:
> > Hi Jan,
> >
> > I have date in different format also, so that is the reason I was
> thinking
> > to do by this approach. How can I make sure this will work on the
> selected
> > partition only and it will not scan the entire table. I will add your
> > suggestion in my UDF as deterministic thing.
> >
> > My simple question here is- How to get the Yesterdays date which I can
> use
> > on the Date Partition I cannot use hiveconf here as I am working with
> Hive
> > 0.6
> >
> >
> >
> >
> > Raihan Jamal
> >
> >
> >
> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com>
> wrote:
> >>
> >> I'm afraid that  he query
> >>
> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>
> >> will scan entire table, because the functions is evaluated at runtime,
> so
> >> Hive doesn't know what the value is when it decides which files to
> scan. I
> >> am not 100% sure though, you should try it.
> >>
> >> Also, you might want to try to add annotation to your UDF saying that
> the
> >> function is deterministic:
> >> @UDFType(deterministic=false)
> >>
> >> I think Hive might be clever enough to evaluate it early enough to use
> the
> >> partition pruning correctly, since it operates on constant expression.
> But
> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> >> optimizations will tell us more. It is actually quite interesting
> question.
> >>
> >> Another way to help Hive with the optimizations might be to skip passing
> >> the format string argument, if you have all dates in same format, you
> can
> >> call the function just like 'yesterdaydate()' and hardcode the format
> in the
> >> function.
> >>
> >> Jan
> >>
> >>
> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>
> >> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>>
> >>>
> >>> I figured that out, it is working fine for me now. The only question I
> >>> have is, if I am doing like this-
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>>
> >>>
> >>>
> >>> Then the above query will be evaluated as below right?
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
> >>>
> >>>
> >>>
> >>> So that means it will look for data in the corresponding dt partition
> >>> (20120806) only right as above table is partitioned on dt column ? And
> it
> >>> will not scan the whole table right?
> >>>
> >>>
> >>>
> >>>
> >>> Raihan Jamal
> >>>
> >>>
> >>>
> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Jamal,
> >>>>
> >>>> Check if the function really returns what it should and that your data
> >>>> are really in yyyyMMdd format. You can do this by simple query like
> this:
> >>>>
> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
> >>>>
> >>>> I don't see anything wrong with the function itself, it works well for
> >>>> me (although I tested it in hive 0.7.1). The only thing I would
> change about
> >>>> it would be to optimize it by calling 'new' only at the time of
> construction
> >>>> and reusing the object when the function is called, but that should
> not
> >>>> affect the functionality at all.
> >>>>
> >>>> Best regards,
> >>>> Jan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Problem
> >>>>>
> >>>>> I created the below UserDefinedFunction to get the yesterday's day in
> >>>>> the format I wanted as I will be passing the format into this below
> method
> >>>>> from the query.
> >>>>>
> >>>>>
> >>>>>
> >>>>> public final class YesterdayDate extends UDF {
> >>>>>
> >>>>>
> >>>>>
> >>>>>                 public String evaluate(final String format) {
> >>>>>
> >>>>>                                 DateFormat dateFormat = new
> >>>>> SimpleDateFormat(format);
> >>>>>
> >>>>>                                 Calendar cal =
> Calendar.getInstance();
> >>>>>
> >>>>>                                 cal.add(Calendar.DATE, -1);
> >>>>>
> >>>>>                                 return
> >>>>> dateFormat.format(cal.getTime()).toString();
> >>>>>
> >>>>>                 }
> >>>>>
> >>>>> }
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> So whenever I try to run the query like below by adding the jar to
> >>>>> classpath and creating the temporary function yesterdaydate, I
> always get
> >>>>> zero result back-
> >>>>>
> >>>>>
> >>>>>
> >>>>> hive> create temporary function yesterdaydate as
> >>>>> 'com.example.hive.udf.YesterdayDate';
> >>>>>
> >>>>> OK
> >>>>>
> >>>>> Time taken: 0.512 seconds
> >>>>>
> >>>>>
> >>>>>
> >>>>> Below is the query I am running-
> >>>>>
> >>>>>
> >>>>>
> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
> LIMIT
> >>>>> 10;
> >>>>>
> >>>>> OK
> >>>>>
> >>>>>
> >>>>>
> >>>>> And I always get zero result back but the data is there in that table
> >>>>> for Aug 5th.
> >>>>>
> >>>>>
> >>>>>
> >>>>> What wrong I am doing? Any suggestions will be appreciated.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
> >>>>> substitution thing, so I cannot use hiveconf here and the above
> table has
> >>>>> been partitioned on dt(date) column.
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: Custom UserDefinedFunction in Hive

Posted by Vijay <te...@gmail.com>.
Given the implementation of the UDF, I don't think hive would be able
to use partition pruning. Especially the version you're using. I'd
really recommend upgrading to a later version that has the hiveconf
support. That can save a lot of trouble rather than trying to get
things working on 0.6

On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <ja...@gmail.com> wrote:
> Hi Jan,
>
> I have date in different format also, so that is the reason I was thinking
> to do by this approach. How can I make sure this will work on the selected
> partition only and it will not scan the entire table. I will add your
> suggestion in my UDF as deterministic thing.
>
> My simple question here is- How to get the Yesterdays date which I can use
> on the Date Partition I cannot use hiveconf here as I am working with Hive
> 0.6
>
>
>
>
> Raihan Jamal
>
>
>
> On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com> wrote:
>>
>> I'm afraid that  he query
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>> will scan entire table, because the functions is evaluated at runtime, so
>> Hive doesn't know what the value is when it decides which files to scan. I
>> am not 100% sure though, you should try it.
>>
>> Also, you might want to try to add annotation to your UDF saying that the
>> function is deterministic:
>> @UDFType(deterministic=false)
>>
>> I think Hive might be clever enough to evaluate it early enough to use the
>> partition pruning correctly, since it operates on constant expression. But
>> again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> optimizations will tell us more. It is actually quite interesting question.
>>
>> Another way to help Hive with the optimizations might be to skip passing
>> the format string argument, if you have all dates in same format, you can
>> call the function just like 'yesterdaydate()' and hardcode the format in the
>> function.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>
>> wrote:
>>>
>>> Hi Jan,
>>>
>>>
>>>
>>> I figured that out, it is working fine for me now. The only question I
>>> have is, if I am doing like this-
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>
>>>
>>>
>>> Then the above query will be evaluated as below right?
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>
>>>
>>>
>>> So that means it will look for data in the corresponding dt partition
>>> (20120806) only right as above table is partitioned on dt column ? And it
>>> will not scan the whole table right?
>>>
>>>
>>>
>>>
>>> Raihan Jamal
>>>
>>>
>>>
>>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>>>>
>>>> Hi Jamal,
>>>>
>>>> Check if the function really returns what it should and that your data
>>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>>
>>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>>
>>>> I don't see anything wrong with the function itself, it works well for
>>>> me (although I tested it in hive 0.7.1). The only thing I would change about
>>>> it would be to optimize it by calling 'new' only at the time of construction
>>>> and reusing the object when the function is called, but that should not
>>>> affect the functionality at all.
>>>>
>>>> Best regards,
>>>> Jan
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Problem
>>>>>
>>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>>> the format I wanted as I will be passing the format into this below method
>>>>> from the query.
>>>>>
>>>>>
>>>>>
>>>>> public final class YesterdayDate extends UDF {
>>>>>
>>>>>
>>>>>
>>>>>                 public String evaluate(final String format) {
>>>>>
>>>>>                                 DateFormat dateFormat = new
>>>>> SimpleDateFormat(format);
>>>>>
>>>>>                                 Calendar cal = Calendar.getInstance();
>>>>>
>>>>>                                 cal.add(Calendar.DATE, -1);
>>>>>
>>>>>                                 return
>>>>> dateFormat.format(cal.getTime()).toString();
>>>>>
>>>>>                 }
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So whenever I try to run the query like below by adding the jar to
>>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>>> zero result back-
>>>>>
>>>>>
>>>>>
>>>>> hive> create temporary function yesterdaydate as
>>>>> 'com.example.hive.udf.YesterdayDate';
>>>>>
>>>>> OK
>>>>>
>>>>> Time taken: 0.512 seconds
>>>>>
>>>>>
>>>>>
>>>>> Below is the query I am running-
>>>>>
>>>>>
>>>>>
>>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>>>>> 10;
>>>>>
>>>>> OK
>>>>>
>>>>>
>>>>>
>>>>> And I always get zero result back but the data is there in that table
>>>>> for Aug 5th.
>>>>>
>>>>>
>>>>>
>>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>>> been partitioned on dt(date) column.
>>>>
>>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
Hi Jan,

I have date in different format also, so that is the reason I was thinking
to do by this approach. How can I make sure this will work on the selected
partition only and it will not scan the entire table. I will add your
suggestion in my UDF as deterministic thing.

My simple question here is- How to get the Yesterdays date which I can use
on the Date Partition I cannot use hiveconf here as I am working with Hive
0.6




*Raihan Jamal*



On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com> wrote:

> I'm afraid that  he query
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
> will scan entire table, because the functions is evaluated at runtime, so
> Hive doesn't know what the value is when it decides which files to scan. I
> am not 100% sure though, you should try it.
>
> Also, you might want to try to add annotation to your UDF saying that the
> function is deterministic:
> @*UDFType(deterministic=false*)
>
> I think Hive might be clever enough to evaluate it early enough to use the
> partition pruning correctly, since it operates on constant expression. But
> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> optimizations will tell us more. It is actually quite interesting question.
>
> Another way to help Hive with the optimizations might be to skip passing
> the format string argument, if you have all dates in same format, you can
> call the function just like 'yesterdaydate()' and hardcode the format in
> the function.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> Hi Jan,
>>
>>
>> I figured that out, it is working fine for me now. The only question I
>> have is, if I am doing like this-
>>
>>
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>>
>>
>> Then the above query will be evaluated as below right?
>>
>>
>>
>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>
>>
>>
>> So that means it will look for data in the corresponding dt partition *(20120806)
>> *only right as above table is partitioned on dt column ? And it will not
>> scan the whole table right?**
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>>
>>> Hi Jamal,
>>>
>>> Check if the function really returns what it should and that your data
>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>
>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>
>>> I don't see anything wrong with the function itself, it works well for
>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>> about it would be to optimize it by calling 'new' only at the time of
>>> construction and reusing the object when the function is called, but that
>>> should not affect the functionality at all.
>>>
>>> Best regards,
>>> Jan
>>>
>>>
>>>
>>>
>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>
>>>> *Problem*
>>>>
>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>> the format I wanted as I will be passing the format into this below method
>>>> from the query.
>>>>
>>>>
>>>>
>>>> *public final class YesterdayDate extends UDF {*
>>>>
>>>> * *
>>>>
>>>> *                public String evaluate(final String format) { *
>>>>
>>>> *                                DateFormat dateFormat = new
>>>> SimpleDateFormat(format); *
>>>>
>>>> *                                Calendar cal = Calendar.getInstance();
>>>> *
>>>>
>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>
>>>> *                                return
>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>
>>>> *                } *
>>>>
>>>> *}*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So whenever I try to run the query like below by adding the jar to
>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>> zero result back-
>>>>
>>>>
>>>>
>>>> hive> create temporary function *yesterdaydate* as
>>>> 'com.example.hive.udf.YesterdayDate';
>>>>
>>>> OK
>>>>
>>>> Time taken: 0.512 seconds
>>>>
>>>>
>>>>
>>>> Below is the query I am running-
>>>>
>>>>
>>>>
>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>> LIMIT 10;*
>>>>
>>>> *OK*
>>>>
>>>> * *
>>>>
>>>> And I always get zero result back but the data is there in that table
>>>> for Aug 5th.**
>>>>
>>>>
>>>>
>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>> been partitioned on dt(date) column.**
>>>>
>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
Oops, sorry I made a copy&paste mistake :) The annotation should read
@*UDFType(deterministic=true*)

Jan

On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <do...@gmail.com> wrote:

> I'm afraid that  he query
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
> will scan entire table, because the functions is evaluated at runtime, so
> Hive doesn't know what the value is when it decides which files to scan. I
> am not 100% sure though, you should try it.
>
> Also, you might want to try to add annotation to your UDF saying that the
> function is deterministic:
> @*UDFType(deterministic=false*)
>
> I think Hive might be clever enough to evaluate it early enough to use the
> partition pruning correctly, since it operates on constant expression. But
> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> optimizations will tell us more. It is actually quite interesting question.
>
> Another way to help Hive with the optimizations might be to skip passing
> the format string argument, if you have all dates in same format, you can
> call the function just like 'yesterdaydate()' and hardcode the format in
> the function.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> Hi Jan,
>>
>>
>> I figured that out, it is working fine for me now. The only question I
>> have is, if I am doing like this-
>>
>>
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>>
>>
>> Then the above query will be evaluated as below right?
>>
>>
>>
>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>
>>
>>
>> So that means it will look for data in the corresponding dt partition *(20120806)
>> *only right as above table is partitioned on dt column ? And it will not
>> scan the whole table right?**
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>>
>>> Hi Jamal,
>>>
>>> Check if the function really returns what it should and that your data
>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>
>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>
>>> I don't see anything wrong with the function itself, it works well for
>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>> about it would be to optimize it by calling 'new' only at the time of
>>> construction and reusing the object when the function is called, but that
>>> should not affect the functionality at all.
>>>
>>> Best regards,
>>> Jan
>>>
>>>
>>>
>>>
>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>
>>>> *Problem*
>>>>
>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>> the format I wanted as I will be passing the format into this below method
>>>> from the query.
>>>>
>>>>
>>>>
>>>> *public final class YesterdayDate extends UDF {*
>>>>
>>>> * *
>>>>
>>>> *                public String evaluate(final String format) { *
>>>>
>>>> *                                DateFormat dateFormat = new
>>>> SimpleDateFormat(format); *
>>>>
>>>> *                                Calendar cal = Calendar.getInstance();
>>>> *
>>>>
>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>
>>>> *                                return
>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>
>>>> *                } *
>>>>
>>>> *}*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So whenever I try to run the query like below by adding the jar to
>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>> zero result back-
>>>>
>>>>
>>>>
>>>> hive> create temporary function *yesterdaydate* as
>>>> 'com.example.hive.udf.YesterdayDate';
>>>>
>>>> OK
>>>>
>>>> Time taken: 0.512 seconds
>>>>
>>>>
>>>>
>>>> Below is the query I am running-
>>>>
>>>>
>>>>
>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>> LIMIT 10;*
>>>>
>>>> *OK*
>>>>
>>>> * *
>>>>
>>>> And I always get zero result back but the data is there in that table
>>>> for Aug 5th.**
>>>>
>>>>
>>>>
>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>> been partitioned on dt(date) column.**
>>>>
>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
Hi Jan,

I have date in different format also, so that is the reason I was thinking
to do by this approach. How can I make sure this will work on the selected
partition only and it will not scan the entire table. I will add your
suggestion in my UDF as deterministic thing.

My simple question here is- How to get the Yesterdays date which I can use
on the Date Partition I cannot use hiveconf here as I am working with Hive
0.6




*Raihan Jamal*



On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <do...@gmail.com> wrote:

> I'm afraid that  he query
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
> will scan entire table, because the functions is evaluated at runtime, so
> Hive doesn't know what the value is when it decides which files to scan. I
> am not 100% sure though, you should try it.
>
> Also, you might want to try to add annotation to your UDF saying that the
> function is deterministic:
> @*UDFType(deterministic=false*)
>
> I think Hive might be clever enough to evaluate it early enough to use the
> partition pruning correctly, since it operates on constant expression. But
> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> optimizations will tell us more. It is actually quite interesting question.
>
> Another way to help Hive with the optimizations might be to skip passing
> the format string argument, if you have all dates in same format, you can
> call the function just like 'yesterdaydate()' and hardcode the format in
> the function.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> Hi Jan,
>>
>>
>> I figured that out, it is working fine for me now. The only question I
>> have is, if I am doing like this-
>>
>>
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>>
>>
>> Then the above query will be evaluated as below right?
>>
>>
>>
>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>
>>
>>
>> So that means it will look for data in the corresponding dt partition *(20120806)
>> *only right as above table is partitioned on dt column ? And it will not
>> scan the whole table right?**
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>>
>>> Hi Jamal,
>>>
>>> Check if the function really returns what it should and that your data
>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>
>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>
>>> I don't see anything wrong with the function itself, it works well for
>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>> about it would be to optimize it by calling 'new' only at the time of
>>> construction and reusing the object when the function is called, but that
>>> should not affect the functionality at all.
>>>
>>> Best regards,
>>> Jan
>>>
>>>
>>>
>>>
>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>
>>>> *Problem*
>>>>
>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>> the format I wanted as I will be passing the format into this below method
>>>> from the query.
>>>>
>>>>
>>>>
>>>> *public final class YesterdayDate extends UDF {*
>>>>
>>>> * *
>>>>
>>>> *                public String evaluate(final String format) { *
>>>>
>>>> *                                DateFormat dateFormat = new
>>>> SimpleDateFormat(format); *
>>>>
>>>> *                                Calendar cal = Calendar.getInstance();
>>>> *
>>>>
>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>
>>>> *                                return
>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>
>>>> *                } *
>>>>
>>>> *}*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So whenever I try to run the query like below by adding the jar to
>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>> zero result back-
>>>>
>>>>
>>>>
>>>> hive> create temporary function *yesterdaydate* as
>>>> 'com.example.hive.udf.YesterdayDate';
>>>>
>>>> OK
>>>>
>>>> Time taken: 0.512 seconds
>>>>
>>>>
>>>>
>>>> Below is the query I am running-
>>>>
>>>>
>>>>
>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>> LIMIT 10;*
>>>>
>>>> *OK*
>>>>
>>>> * *
>>>>
>>>> And I always get zero result back but the data is there in that table
>>>> for Aug 5th.**
>>>>
>>>>
>>>>
>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>> been partitioned on dt(date) column.**
>>>>
>>>
>>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
I'm afraid that  he query

SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;

will scan entire table, because the functions is evaluated at runtime, so
Hive doesn't know what the value is when it decides which files to scan. I
am not 100% sure though, you should try it.

Also, you might want to try to add annotation to your UDF saying that the
function is deterministic:
@*UDFType(deterministic=false*)

I think Hive might be clever enough to evaluate it early enough to use the
partition pruning correctly, since it operates on constant expression. But
again, I'm not really sure, maybe someone with deeper knowledge of Hive
optimizations will tell us more. It is actually quite interesting question.

Another way to help Hive with the optimizations might be to skip passing
the format string argument, if you have all dates in same format, you can
call the function just like 'yesterdaydate()' and hardcode the format in
the function.

Jan


On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com> wrote:

> Hi Jan,
>
>
> I figured that out, it is working fine for me now. The only question I
> have is, if I am doing like this-
>
>
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
>
>
> Then the above query will be evaluated as below right?
>
>
>
> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>
>
>
> So that means it will look for data in the corresponding dt partition *(20120806)
> *only right as above table is partitioned on dt column ? And it will not
> scan the whole table right?**
>
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>
>> Hi Jamal,
>>
>> Check if the function really returns what it should and that your data
>> are really in yyyyMMdd format. You can do this by simple query like this:
>>
>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>
>> I don't see anything wrong with the function itself, it works well for me
>> (although I tested it in hive 0.7.1). The only thing I would change about
>> it would be to optimize it by calling 'new' only at the time of
>> construction and reusing the object when the function is called, but that
>> should not affect the functionality at all.
>>
>> Best regards,
>> Jan
>>
>>
>>
>>
>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> *Problem*
>>>
>>> I created the below UserDefinedFunction to get the yesterday's day in
>>> the format I wanted as I will be passing the format into this below method
>>> from the query.
>>>
>>>
>>>
>>> *public final class YesterdayDate extends UDF {*
>>>
>>> * *
>>>
>>> *                public String evaluate(final String format) { *
>>>
>>> *                                DateFormat dateFormat = new
>>> SimpleDateFormat(format); *
>>>
>>> *                                Calendar cal = Calendar.getInstance();*
>>>
>>> *                                cal.add(Calendar.DATE, -1);     *
>>>
>>> *                                return
>>> dateFormat.format(cal.getTime()).toString(); *
>>>
>>> *                } *
>>>
>>> *}*
>>>
>>>
>>>
>>>
>>>
>>> So whenever I try to run the query like below by adding the jar to
>>> classpath and creating the temporary function yesterdaydate, I always get
>>> zero result back-
>>>
>>>
>>>
>>> hive> create temporary function *yesterdaydate* as
>>> 'com.example.hive.udf.YesterdayDate';
>>>
>>> OK
>>>
>>> Time taken: 0.512 seconds
>>>
>>>
>>>
>>> Below is the query I am running-
>>>
>>>
>>>
>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>>> 10;*
>>>
>>> *OK*
>>>
>>> * *
>>>
>>> And I always get zero result back but the data is there in that table
>>> for Aug 5th.**
>>>
>>>
>>>
>>> What wrong I am doing? Any suggestions will be appreciated.
>>>
>>>
>>>
>>>
>>>
>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>> substitution thing, so I cannot use hiveconf here and the above table has
>>> been partitioned on dt(date) column.**
>>>
>>
>>
>

RE: Custom UserDefinedFunction in Hive

Posted by Venkatesh Kavuluri <vk...@outlook.com>.
Actually as the custom UDF "yesterday()" mentioned below is NOT marked with the annotation @UDFType(deterministic = false), partition pruning should work in practice.  The PartitionPruner has a logic around this annotation to check if a generic UDF is deterministic or not and would skip partition pruning if it finds any non-deterministic function.   
http://svn.apache.org/repos/asf/hive/branches/branch-0.9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionPruner.java
I guess this check is implemented to avoid inconsistencies in result set that might arise in scenarios like below where predicate is :part_col = f(time) and regular_col = f(time).
The expression involving "part_col" is evaluated at compile time and the expression involving "regular_col" is evaluated at run time and the function yesterday() might return different values if the query is executed around midnight.
Thanks,Venkatesh 

Date: Wed, 8 Aug 2012 03:49:56 -0700
From: bejoy_ks@yahoo.com
Subject: Re: Custom UserDefinedFunction in Hive
To: user@hive.apache.org

Hi Raihan
UDFs are evaluated at run time when the query is executed. But it is hive parser during query parse time decides the boundary of data to be used for the query, ie data from which all partitions has to be processed. Because of this the entire table will be scanned for your query.
 Regards,Bejoy KS
        From: Raihan Jamal <ja...@gmail.com>
 To: user@hive.apache.org 
Cc: dev@hive.apache.org 
 Sent: Tuesday, August 7, 2012 10:50 PM
 Subject: Re: Custom UserDefinedFunction in Hive
   

Hi Jan,

 

I figured that out, it is working fine for me now. The only question I have is, if I am doing like this-

 

SELECT * FROM REALTIME where dt=
yesterdaydate('yyyyMMdd') LIMIT 10;

 

Then the above query will be evaluated as below right?

 

SELECT * FROM REALTIME where dt= ‘20120806’
LIMIT 10;

 

So that means it will look for data in the corresponding dt
partition (20120806) only right as above table is partitioned on dt column ?
And it will not scan the whole table right?


Raihan Jamal





On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:


Hi Jamal,
Check if the function really returns what it should and that your data are really in yyyyMMdd format. You can do this by simple query like this:
SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;




I don't see anything wrong with the function itself, it works well for me (although I tested it in hive 0.7.1). The only thing I would change about it would be to optimize it by calling 'new' only at the time of construction and reusing the object when the function is called, but that should not affect the functionality at all.




Best regards,

Jan





On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com> wrote:




Problem





I created the below UserDefinedFunction to get the yesterday's
day in the format I wanted as I will be passing the format into this below
method from the query.

 

public final class YesterdayDate extends UDF {

 

               
public String evaluate(final String format) { 

                               
DateFormat dateFormat = new SimpleDateFormat(format); 

                               
Calendar cal = Calendar.getInstance();

                               
cal.add(Calendar.DATE, -1);     

                               
return dateFormat.format(cal.getTime()).toString(); 

               
} 

}

 

 

So whenever I try to run the query like below by adding the jar
to classpath and creating the temporary function yesterdaydate, I always get
zero result back-

 

hive> create temporary function yesterdaydate as
'com.example.hive.udf.YesterdayDate';

OK

Time taken: 0.512 seconds

 

Below is the query I am running-

 

hive> SELECT * FROM REALTIME where dt=
yesterdaydate('yyyyMMdd') LIMIT 10;

OK

 

And I always get zero result back but the data is there in that
table for Aug 5th.

 

What wrong I am doing? Any suggestions will be appreciated.

 

 

NOTE:- As I am working with Hive 0.6 so it doesn’t support
variable substitution thing, so I cannot use hiveconf here and the above table
has been partitioned on dt(date) column.







     		 	   		  

Re: Custom UserDefinedFunction in Hive

Posted by Bejoy Ks <be...@yahoo.com>.
Hi Raihan

UDFs are evaluated at run time when the query is executed. But it is hive parser during query parse time decides the boundary of data to be used for the query, ie data from which all partitions has to be processed.Because of this the entire table will be scanned for your query.


 
Regards,
Bejoy KS


________________________________
 From: Raihan Jamal <ja...@gmail.com>
To: user@hive.apache.org 
Cc: dev@hive.apache.org 
Sent: Tuesday, August 7, 2012 10:50 PM
Subject: Re: Custom UserDefinedFunction in Hive
 

Hi Jan,
 I figured that out, it is working fine for me now. The only question I have is, if I am doing like this- 
 
SELECT * FROM REALTIME where dt=
yesterdaydate('yyyyMMdd') LIMIT 10;
 
Then the above query will be evaluated as below right?
 
SELECT * FROM REALTIME where dt= ‘20120806’
LIMIT 10;
 
So that means it will look for data in the corresponding dt
partition (20120806) only right as above table is partitioned on dt column ?
And it will not scan the whole table right?



Raihan Jamal



On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:

Hi Jamal,
>
>
>Check if the function really returns what it should and that your data are really in yyyyMMdd format. You can do this by simple query like this:
>
>
>SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>
>
>I don't see anything wrong with the function itself, it works well for me (although I tested it in hive 0.7.1). The only thing I would change about it would be to optimize it by calling 'new' only at the time of construction and reusing the object when the function is called, but that should not affect the functionality at all.
>
>
>Best regards,
>Jan
>
>
>
>
>
>
>
>On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com> wrote:
>
>Problem
>>I created the below UserDefinedFunction to get the yesterday's
day in the format I wanted as I will be passing the format into this below
method from the query.
>> 
>>public final class YesterdayDate extends UDF {
>> 
>>               
public String evaluate(final String format) { 
>>                               
DateFormat dateFormat = new SimpleDateFormat(format); 
>>                               
Calendar cal = Calendar.getInstance();
>>                               
cal.add(Calendar.DATE, -1);     
>>                               
return dateFormat.format(cal.getTime()).toString(); 
>>               
} 
>>}
>> 
>> 
>>So whenever I try to run the query like below by adding the jar
to classpath and creating the temporary function yesterdaydate, I always get
zero result back-
>> 
>>hive> create temporary function yesterdaydate as
'com.example.hive.udf.YesterdayDate';
>>OK
>>Time taken: 0.512 seconds
>> 
>>Below is the query I am running-
>> 
>>hive> SELECT * FROM REALTIME where dt=
yesterdaydate('yyyyMMdd') LIMIT 10;
>>OK
>> 
>>And I always get zero result back but the data is there in that
table for Aug 5th.
>> 
>>What wrong I am doing? Any suggestions will be appreciated.
>> 
>> 
>>NOTE:- As I am working with Hive 0.6 so it doesn’t support
variable substitution thing, so I cannot use hiveconf here and the above table
has been partitioned on dt(date) column.
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
I'm afraid that  he query

SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;

will scan entire table, because the functions is evaluated at runtime, so
Hive doesn't know what the value is when it decides which files to scan. I
am not 100% sure though, you should try it.

Also, you might want to try to add annotation to your UDF saying that the
function is deterministic:
@*UDFType(deterministic=false*)

I think Hive might be clever enough to evaluate it early enough to use the
partition pruning correctly, since it operates on constant expression. But
again, I'm not really sure, maybe someone with deeper knowledge of Hive
optimizations will tell us more. It is actually quite interesting question.

Another way to help Hive with the optimizations might be to skip passing
the format string argument, if you have all dates in same format, you can
call the function just like 'yesterdaydate()' and hardcode the format in
the function.

Jan


On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <ja...@gmail.com> wrote:

> Hi Jan,
>
>
> I figured that out, it is working fine for me now. The only question I
> have is, if I am doing like this-
>
>
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
>
>
> Then the above query will be evaluated as below right?
>
>
>
> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>
>
>
> So that means it will look for data in the corresponding dt partition *(20120806)
> *only right as above table is partitioned on dt column ? And it will not
> scan the whole table right?**
>
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>
>> Hi Jamal,
>>
>> Check if the function really returns what it should and that your data
>> are really in yyyyMMdd format. You can do this by simple query like this:
>>
>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>
>> I don't see anything wrong with the function itself, it works well for me
>> (although I tested it in hive 0.7.1). The only thing I would change about
>> it would be to optimize it by calling 'new' only at the time of
>> construction and reusing the object when the function is called, but that
>> should not affect the functionality at all.
>>
>> Best regards,
>> Jan
>>
>>
>>
>>
>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> *Problem*
>>>
>>> I created the below UserDefinedFunction to get the yesterday's day in
>>> the format I wanted as I will be passing the format into this below method
>>> from the query.
>>>
>>>
>>>
>>> *public final class YesterdayDate extends UDF {*
>>>
>>> * *
>>>
>>> *                public String evaluate(final String format) { *
>>>
>>> *                                DateFormat dateFormat = new
>>> SimpleDateFormat(format); *
>>>
>>> *                                Calendar cal = Calendar.getInstance();*
>>>
>>> *                                cal.add(Calendar.DATE, -1);     *
>>>
>>> *                                return
>>> dateFormat.format(cal.getTime()).toString(); *
>>>
>>> *                } *
>>>
>>> *}*
>>>
>>>
>>>
>>>
>>>
>>> So whenever I try to run the query like below by adding the jar to
>>> classpath and creating the temporary function yesterdaydate, I always get
>>> zero result back-
>>>
>>>
>>>
>>> hive> create temporary function *yesterdaydate* as
>>> 'com.example.hive.udf.YesterdayDate';
>>>
>>> OK
>>>
>>> Time taken: 0.512 seconds
>>>
>>>
>>>
>>> Below is the query I am running-
>>>
>>>
>>>
>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>>> 10;*
>>>
>>> *OK*
>>>
>>> * *
>>>
>>> And I always get zero result back but the data is there in that table
>>> for Aug 5th.**
>>>
>>>
>>>
>>> What wrong I am doing? Any suggestions will be appreciated.
>>>
>>>
>>>
>>>
>>>
>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>> substitution thing, so I cannot use hiveconf here and the above table has
>>> been partitioned on dt(date) column.**
>>>
>>
>>
>

Re: Custom UserDefinedFunction in Hive

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Have you tried using EXPLAIN[1] on your query? I usually like to use that
to get a better understanding of what my query is actually doing and
debugging at other times.

[1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

On Tue, Aug 7, 2012 at 12:20 PM, Raihan Jamal <ja...@gmail.com> wrote:

> Hi Jan,
>
>
> I figured that out, it is working fine for me now. The only question I
> have is, if I am doing like this-
>
>
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
>
>
> Then the above query will be evaluated as below right?
>
>
>
> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>
>
>
> So that means it will look for data in the corresponding dt partition *(20120806)
> *only right as above table is partitioned on dt column ? And it will not
> scan the whole table right?**
>
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:
>
>> Hi Jamal,
>>
>> Check if the function really returns what it should and that your data
>> are really in yyyyMMdd format. You can do this by simple query like this:
>>
>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>
>> I don't see anything wrong with the function itself, it works well for me
>> (although I tested it in hive 0.7.1). The only thing I would change about
>> it would be to optimize it by calling 'new' only at the time of
>> construction and reusing the object when the function is called, but that
>> should not affect the functionality at all.
>>
>> Best regards,
>> Jan
>>
>>
>>
>>
>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> *Problem*
>>>
>>> I created the below UserDefinedFunction to get the yesterday's day in
>>> the format I wanted as I will be passing the format into this below method
>>> from the query.
>>>
>>>
>>>
>>> *public final class YesterdayDate extends UDF {*
>>>
>>> * *
>>>
>>> *                public String evaluate(final String format) { *
>>>
>>> *                                DateFormat dateFormat = new
>>> SimpleDateFormat(format); *
>>>
>>> *                                Calendar cal = Calendar.getInstance();*
>>>
>>> *                                cal.add(Calendar.DATE, -1);     *
>>>
>>> *                                return
>>> dateFormat.format(cal.getTime()).toString(); *
>>>
>>> *                } *
>>>
>>> *}*
>>>
>>>
>>>
>>>
>>>
>>> So whenever I try to run the query like below by adding the jar to
>>> classpath and creating the temporary function yesterdaydate, I always get
>>> zero result back-
>>>
>>>
>>>
>>> hive> create temporary function *yesterdaydate* as
>>> 'com.example.hive.udf.YesterdayDate';
>>>
>>> OK
>>>
>>> Time taken: 0.512 seconds
>>>
>>>
>>>
>>> Below is the query I am running-
>>>
>>>
>>>
>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>>> 10;*
>>>
>>> *OK*
>>>
>>> * *
>>>
>>> And I always get zero result back but the data is there in that table
>>> for Aug 5th.**
>>>
>>>
>>>
>>> What wrong I am doing? Any suggestions will be appreciated.
>>>
>>>
>>>
>>>
>>>
>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>> substitution thing, so I cannot use hiveconf here and the above table has
>>> been partitioned on dt(date) column.**
>>>
>>
>>
>


-- 
Swarnim

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
Hi Jan,


I figured that out, it is working fine for me now. The only question I have
is, if I am doing like this-



SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;



Then the above query will be evaluated as below right?



SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;



So that means it will look for data in the corresponding dt partition
*(20120806)
*only right as above table is partitioned on dt column ? And it will not
scan the whole table right?**



*Raihan Jamal*



On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:

> Hi Jamal,
>
> Check if the function really returns what it should and that your data are
> really in yyyyMMdd format. You can do this by simple query like this:
>
> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>
> I don't see anything wrong with the function itself, it works well for me
> (although I tested it in hive 0.7.1). The only thing I would change about
> it would be to optimize it by calling 'new' only at the time of
> construction and reusing the object when the function is called, but that
> should not affect the functionality at all.
>
> Best regards,
> Jan
>
>
>
>
> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> *Problem*
>>
>> I created the below UserDefinedFunction to get the yesterday's day in the
>> format I wanted as I will be passing the format into this below method from
>> the query.
>>
>>
>>
>> *public final class YesterdayDate extends UDF {*
>>
>> * *
>>
>> *                public String evaluate(final String format) { *
>>
>> *                                DateFormat dateFormat = new
>> SimpleDateFormat(format); *
>>
>> *                                Calendar cal = Calendar.getInstance();*
>>
>> *                                cal.add(Calendar.DATE, -1);     *
>>
>> *                                return
>> dateFormat.format(cal.getTime()).toString(); *
>>
>> *                } *
>>
>> *}*
>>
>>
>>
>>
>>
>> So whenever I try to run the query like below by adding the jar to
>> classpath and creating the temporary function yesterdaydate, I always get
>> zero result back-
>>
>>
>>
>> hive> create temporary function *yesterdaydate* as
>> 'com.example.hive.udf.YesterdayDate';
>>
>> OK
>>
>> Time taken: 0.512 seconds
>>
>>
>>
>> Below is the query I am running-
>>
>>
>>
>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>> 10;*
>>
>> *OK*
>>
>> * *
>>
>> And I always get zero result back but the data is there in that table for
>> Aug 5th.**
>>
>>
>>
>> What wrong I am doing? Any suggestions will be appreciated.
>>
>>
>>
>>
>>
>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>> substitution thing, so I cannot use hiveconf here and the above table has
>> been partitioned on dt(date) column.**
>>
>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
I tested that function using main and by printing it out and it works fine.
As I am trying to get the Yesterday's date.

I need my query to be like this as today's date is Aug 6th, so query should
be for Aug 5th. And this works fine for me.

*SELECT * FROM REALTIME where dt= '20120805' LIMIT 10;*

So Instead of doing the above way, I wanted to do it like below- And the
below query should give the same result as above query. And when I tried
doing this way, I get zero result back.

*SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;*

So something is wrong the way I am doing it for sure?



*Raihan Jamal*



On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:

> Hi Jamal,
>
> Check if the function really returns what it should and that your data are
> really in yyyyMMdd format. You can do this by simple query like this:
>
> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>
> I don't see anything wrong with the function itself, it works well for me
> (although I tested it in hive 0.7.1). The only thing I would change about
> it would be to optimize it by calling 'new' only at the time of
> construction and reusing the object when the function is called, but that
> should not affect the functionality at all.
>
> Best regards,
> Jan
>
>
>
>
> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> *Problem*
>>
>> I created the below UserDefinedFunction to get the yesterday's day in the
>> format I wanted as I will be passing the format into this below method from
>> the query.
>>
>>
>>
>> *public final class YesterdayDate extends UDF {*
>>
>> * *
>>
>> *                public String evaluate(final String format) { *
>>
>> *                                DateFormat dateFormat = new
>> SimpleDateFormat(format); *
>>
>> *                                Calendar cal = Calendar.getInstance();*
>>
>> *                                cal.add(Calendar.DATE, -1);     *
>>
>> *                                return
>> dateFormat.format(cal.getTime()).toString(); *
>>
>> *                } *
>>
>> *}*
>>
>>
>>
>>
>>
>> So whenever I try to run the query like below by adding the jar to
>> classpath and creating the temporary function yesterdaydate, I always get
>> zero result back-
>>
>>
>>
>> hive> create temporary function *yesterdaydate* as
>> 'com.example.hive.udf.YesterdayDate';
>>
>> OK
>>
>> Time taken: 0.512 seconds
>>
>>
>>
>> Below is the query I am running-
>>
>>
>>
>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>> 10;*
>>
>> *OK*
>>
>> * *
>>
>> And I always get zero result back but the data is there in that table for
>> Aug 5th.**
>>
>>
>>
>> What wrong I am doing? Any suggestions will be appreciated.
>>
>>
>>
>>
>>
>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>> substitution thing, so I cannot use hiveconf here and the above table has
>> been partitioned on dt(date) column.**
>>
>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
Hi Jan,


I figured that out, it is working fine for me now. The only question I have
is, if I am doing like this-



SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;



Then the above query will be evaluated as below right?



SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;



So that means it will look for data in the corresponding dt partition
*(20120806)
*only right as above table is partitioned on dt column ? And it will not
scan the whole table right?**



*Raihan Jamal*



On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:

> Hi Jamal,
>
> Check if the function really returns what it should and that your data are
> really in yyyyMMdd format. You can do this by simple query like this:
>
> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>
> I don't see anything wrong with the function itself, it works well for me
> (although I tested it in hive 0.7.1). The only thing I would change about
> it would be to optimize it by calling 'new' only at the time of
> construction and reusing the object when the function is called, but that
> should not affect the functionality at all.
>
> Best regards,
> Jan
>
>
>
>
> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> *Problem*
>>
>> I created the below UserDefinedFunction to get the yesterday's day in the
>> format I wanted as I will be passing the format into this below method from
>> the query.
>>
>>
>>
>> *public final class YesterdayDate extends UDF {*
>>
>> * *
>>
>> *                public String evaluate(final String format) { *
>>
>> *                                DateFormat dateFormat = new
>> SimpleDateFormat(format); *
>>
>> *                                Calendar cal = Calendar.getInstance();*
>>
>> *                                cal.add(Calendar.DATE, -1);     *
>>
>> *                                return
>> dateFormat.format(cal.getTime()).toString(); *
>>
>> *                } *
>>
>> *}*
>>
>>
>>
>>
>>
>> So whenever I try to run the query like below by adding the jar to
>> classpath and creating the temporary function yesterdaydate, I always get
>> zero result back-
>>
>>
>>
>> hive> create temporary function *yesterdaydate* as
>> 'com.example.hive.udf.YesterdayDate';
>>
>> OK
>>
>> Time taken: 0.512 seconds
>>
>>
>>
>> Below is the query I am running-
>>
>>
>>
>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>> 10;*
>>
>> *OK*
>>
>> * *
>>
>> And I always get zero result back but the data is there in that table for
>> Aug 5th.**
>>
>>
>>
>> What wrong I am doing? Any suggestions will be appreciated.
>>
>>
>>
>>
>>
>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>> substitution thing, so I cannot use hiveconf here and the above table has
>> been partitioned on dt(date) column.**
>>
>
>

Re: Custom UserDefinedFunction in Hive

Posted by Raihan Jamal <ja...@gmail.com>.
I tested that function using main and by printing it out and it works fine.
As I am trying to get the Yesterday's date.

I need my query to be like this as today's date is Aug 6th, so query should
be for Aug 5th. And this works fine for me.

*SELECT * FROM REALTIME where dt= '20120805' LIMIT 10;*

So Instead of doing the above way, I wanted to do it like below- And the
below query should give the same result as above query. And when I tried
doing this way, I get zero result back.

*SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;*

So something is wrong the way I am doing it for sure?



*Raihan Jamal*



On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <do...@gmail.com> wrote:

> Hi Jamal,
>
> Check if the function really returns what it should and that your data are
> really in yyyyMMdd format. You can do this by simple query like this:
>
> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>
> I don't see anything wrong with the function itself, it works well for me
> (although I tested it in hive 0.7.1). The only thing I would change about
> it would be to optimize it by calling 'new' only at the time of
> construction and reusing the object when the function is called, but that
> should not affect the functionality at all.
>
> Best regards,
> Jan
>
>
>
>
> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> *Problem*
>>
>> I created the below UserDefinedFunction to get the yesterday's day in the
>> format I wanted as I will be passing the format into this below method from
>> the query.
>>
>>
>>
>> *public final class YesterdayDate extends UDF {*
>>
>> * *
>>
>> *                public String evaluate(final String format) { *
>>
>> *                                DateFormat dateFormat = new
>> SimpleDateFormat(format); *
>>
>> *                                Calendar cal = Calendar.getInstance();*
>>
>> *                                cal.add(Calendar.DATE, -1);     *
>>
>> *                                return
>> dateFormat.format(cal.getTime()).toString(); *
>>
>> *                } *
>>
>> *}*
>>
>>
>>
>>
>>
>> So whenever I try to run the query like below by adding the jar to
>> classpath and creating the temporary function yesterdaydate, I always get
>> zero result back-
>>
>>
>>
>> hive> create temporary function *yesterdaydate* as
>> 'com.example.hive.udf.YesterdayDate';
>>
>> OK
>>
>> Time taken: 0.512 seconds
>>
>>
>>
>> Below is the query I am running-
>>
>>
>>
>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
>> 10;*
>>
>> *OK*
>>
>> * *
>>
>> And I always get zero result back but the data is there in that table for
>> Aug 5th.**
>>
>>
>>
>> What wrong I am doing? Any suggestions will be appreciated.
>>
>>
>>
>>
>>
>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>> substitution thing, so I cannot use hiveconf here and the above table has
>> been partitioned on dt(date) column.**
>>
>
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
Hi Jamal,

Check if the function really returns what it should and that your data are
really in yyyyMMdd format. You can do this by simple query like this:

SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;

I don't see anything wrong with the function itself, it works well for me
(although I tested it in hive 0.7.1). The only thing I would change about
it would be to optimize it by calling 'new' only at the time of
construction and reusing the object when the function is called, but that
should not affect the functionality at all.

Best regards,
Jan




On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com> wrote:

> *Problem*
>
> I created the below UserDefinedFunction to get the yesterday's day in the
> format I wanted as I will be passing the format into this below method from
> the query.
>
>
>
> *public final class YesterdayDate extends UDF {*
>
> * *
>
> *                public String evaluate(final String format) { *
>
> *                                DateFormat dateFormat = new
> SimpleDateFormat(format); *
>
> *                                Calendar cal = Calendar.getInstance();*
>
> *                                cal.add(Calendar.DATE, -1);     *
>
> *                                return
> dateFormat.format(cal.getTime()).toString(); *
>
> *                } *
>
> *}*
>
>
>
>
>
> So whenever I try to run the query like below by adding the jar to
> classpath and creating the temporary function yesterdaydate, I always get
> zero result back-
>
>
>
> hive> create temporary function *yesterdaydate* as
> 'com.example.hive.udf.YesterdayDate';
>
> OK
>
> Time taken: 0.512 seconds
>
>
>
> Below is the query I am running-
>
>
>
> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
> 10;*
>
> *OK*
>
> * *
>
> And I always get zero result back but the data is there in that table for
> Aug 5th.**
>
>
>
> What wrong I am doing? Any suggestions will be appreciated.
>
>
>
>
>
> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
> substitution thing, so I cannot use hiveconf here and the above table has
> been partitioned on dt(date) column.**
>

Re: Custom UserDefinedFunction in Hive

Posted by Jan Dolinár <do...@gmail.com>.
Hi Jamal,

Check if the function really returns what it should and that your data are
really in yyyyMMdd format. You can do this by simple query like this:

SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;

I don't see anything wrong with the function itself, it works well for me
(although I tested it in hive 0.7.1). The only thing I would change about
it would be to optimize it by calling 'new' only at the time of
construction and reusing the object when the function is called, but that
should not affect the functionality at all.

Best regards,
Jan




On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <ja...@gmail.com> wrote:

> *Problem*
>
> I created the below UserDefinedFunction to get the yesterday's day in the
> format I wanted as I will be passing the format into this below method from
> the query.
>
>
>
> *public final class YesterdayDate extends UDF {*
>
> * *
>
> *                public String evaluate(final String format) { *
>
> *                                DateFormat dateFormat = new
> SimpleDateFormat(format); *
>
> *                                Calendar cal = Calendar.getInstance();*
>
> *                                cal.add(Calendar.DATE, -1);     *
>
> *                                return
> dateFormat.format(cal.getTime()).toString(); *
>
> *                } *
>
> *}*
>
>
>
>
>
> So whenever I try to run the query like below by adding the jar to
> classpath and creating the temporary function yesterdaydate, I always get
> zero result back-
>
>
>
> hive> create temporary function *yesterdaydate* as
> 'com.example.hive.udf.YesterdayDate';
>
> OK
>
> Time taken: 0.512 seconds
>
>
>
> Below is the query I am running-
>
>
>
> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT
> 10;*
>
> *OK*
>
> * *
>
> And I always get zero result back but the data is there in that table for
> Aug 5th.**
>
>
>
> What wrong I am doing? Any suggestions will be appreciated.
>
>
>
>
>
> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
> substitution thing, so I cannot use hiveconf here and the above table has
> been partitioned on dt(date) column.**
>