You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ramkumar <ra...@yahoo.com> on 2012/04/19 11:57:03 UTC

nested UDFs on Partition column

Hi,

I have a table partitioned by local_date.  When I write a query with

WHERE local_date = =date_add('2011-12-07',3) ,

hive executes the UDF ahead and looks only into the specific partitions.   But when the udf becomes more complex like

WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),

hive looks through all the partitions even though the above function  can very well be computed ahead of time and optimize the query.  Is this behaviour intentional ? And is there a workaround other than hardcoding the date or using a param?

Thanks,
Ramkumar

Re: nested UDFs on Partition column

Posted by Edward Capriolo <ed...@gmail.com>.

@Deterministic was added after many of the UDFs some that are
deterministic may not be marked as such.

On Tue, May 1, 2012 at 9:38 AM, Ramkumar <ra...@yahoo.com> wrote:
> Thanks. Yes, thats what the UDF writers seemed to have intended. The way
> this has been done is by using the 'deterministic' attribute in  UDFType
> annotation.
> See
> http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/ql/udf/UDFUnixTimeStamp.java.html
>
> ________________________________
> From: Philip Tromans <ph...@gmail.com>
> To: user@hive.apache.org
> Sent: Thursday, April 19, 2012 4:38 PM
> Subject: Re: nested UDFs on Partition column
>
> In fact, it's just not a reasonable thing to do partition pruning on.
> Imagine a situation where you had:
>
> WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
> f(unix_timestamp).
>
> The right hand side of the predicate has to be evaluated at map-time,
> whereas you're assuming that left hand side should be evaluated at
> compile time, which means you have two different values of
> unix_timestamp() floating around, which can only end badly.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 16:35, Philip Tromans <ph...@gmail.com> wrote:
>> I don't know what the state of Hive's partition pruning is, but I
>> would imagine that the problem is that the two example you're giving
>> are fundamentally different.
>>
>> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>>
>> the udf is a function of some constants, so the constant gets
>> evaluated at compile time.
>>
>> 2) WHERE local_date =
>> date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>
>> unix_timestamp() is not explicitly a constant, and so the expression
>> won't be simplified.
>>
>> I would imagine that the constant simplification code probably doesn't
>> know the difference between a partition column and a real column, and
>> so treats everything as a real column. If local_date wasn't a
>> partition column, then there's no reasonable way of simplifying that
>> predicate at compile time.
>>
>> Cheers,
>>
>> Phil.
>>
>> On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
>>> as per my understanding,
>>>
>>> In this case hive needs to look for all the partitions because it does
>>> not
>>> have the value before hand on the partition check and note the udfs are
>>> executed on the mapred and not on hive client side.
>>>
>>> I would suggest you write a hive query in a file and replace the
>>> partition
>>> value with a variable
>>> something like
>>>
>>> for partitionValue in values
>>>
>>>           hive $HIVEPARAMS -hiveconf  partition=$partition -e
>>> hivequery.hql
>>>
>>> and then in hivequery.sql you can refer the variable with
>>>
>>> where column_name = '${hiveconf:partition}'
>>>
>>> I may be wrong in interpreting the execution pattern of hivequery but
>>> this
>>> approach solved my problem
>>> Thanks,
>>> nitin
>>>
>>>
>>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a table partitioned by local_date.  When I write a query with
>>>>
>>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>>
>>>> hive executes the UDF ahead and looks only into the specific partitions.
>>>> But when the udf becomes more complex like
>>>>
>>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>>
>>>> hive looks through all the partitions even though the above function
>>>> can
>>>> very well be computed ahead of time and optimize the query.  Is this
>>>> behaviour intentional ? And is there a workaround other than hardcoding
>>>> the
>>>> date or using a param?
>>>>
>>>> Thanks,
>>>> Ramkumar
>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>
>

Re: nested UDFs on Partition column

Posted by Ramkumar <ra...@yahoo.com>.

Thanks. Yes, thats what the UDF writers seemed to have intended. The way this has been done is by using the 'deterministic' attribute in  UDFType annotation.
See 
http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/ql/udf/UDFUnixTimeStamp.java.html



________________________________
 From: Philip Tromans <ph...@gmail.com>
To: user@hive.apache.org 
Sent: Thursday, April 19, 2012 4:38 PM
Subject: Re: nested UDFs on Partition column
 
In fact, it's just not a reasonable thing to do partition pruning on.
Imagine a situation where you had:

WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
f(unix_timestamp).

The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.

Cheers,

Phil.

On 19 April 2012 16:35, Philip Tromans <ph...@gmail.com> wrote:
> I don't know what the state of Hive's partition pruning is, but I
> would imagine that the problem is that the two example you're giving
> are fundamentally different.
>
> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>
> the udf is a
 function of some constants, so the constant gets
> evaluated at compile time.
>
> 2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> unix_timestamp() is not explicitly a constant, and so the expression
> won't be simplified.
>
> I would imagine that the constant simplification code probably doesn't
> know the difference between a partition column and a real column, and
> so treats everything as a real column. If local_date wasn't a
> partition column, then there's no reasonable way of simplifying that
> predicate at compile time.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
>> as per my understanding,
>>
>> In this case hive needs to look for all the
 partitions because it does not
>> have the value before hand on the partition check and note the udfs are
>> executed on the mapred and not on hive client side.
>>
>> I would suggest you write a hive query in a file and replace the partition
>> value with a variable
>> something like
>>
>> for partitionValue in values
>>
>>           hive $HIVEPARAMS -hiveconf  partition=$partition -e hivequery.hql
>>
>> and then in hivequery.sql you can refer the variable with
>>
>> where column_name = '${hiveconf:partition}'
>>
>> I may be wrong in interpreting the execution pattern of hivequery but this
>> approach solved my problem
>> Thanks,
>> nitin
>>
>>
>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table partitioned by local_date.  When I write a query with
>>>
>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>
>>> hive executes the UDF ahead and looks only into the specific partitions.
>>> But when the udf becomes more complex like
>>>
>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>
>>> hive looks through all the partitions even though the above function  can
>>> very well be computed ahead of time and optimize the query.  Is this
>>> behaviour intentional ? And is there a workaround other than hardcoding the
>>> date or using a
 param?
>>>
>>> Thanks,
>>> Ramkumar
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>

Re: nested UDFs on Partition column

Posted by Philip Tromans <ph...@gmail.com>.

In fact, it's just not a reasonable thing to do partition pruning on.
Imagine a situation where you had:

WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
f(unix_timestamp).

The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.

Cheers,

Phil.

On 19 April 2012 16:35, Philip Tromans <ph...@gmail.com> wrote:
> I don't know what the state of Hive's partition pruning is, but I
> would imagine that the problem is that the two example you're giving
> are fundamentally different.
>
> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>
> the udf is a function of some constants, so the constant gets
> evaluated at compile time.
>
> 2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> unix_timestamp() is not explicitly a constant, and so the expression
> won't be simplified.
>
> I would imagine that the constant simplification code probably doesn't
> know the difference between a partition column and a real column, and
> so treats everything as a real column. If local_date wasn't a
> partition column, then there's no reasonable way of simplifying that
> predicate at compile time.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
>> as per my understanding,
>>
>> In this case hive needs to look for all the partitions because it does not
>> have the value before hand on the partition check and note the udfs are
>> executed on the mapred and not on hive client side.
>>
>> I would suggest you write a hive query in a file and replace the partition
>> value with a variable
>> something like
>>
>> for partitionValue in values
>>
>>           hive $HIVEPARAMS -hiveconf  partition=$partition -e hivequery.hql
>>
>> and then in hivequery.sql you can refer the variable with
>>
>> where column_name = '${hiveconf:partition}'
>>
>> I may be wrong in interpreting the execution pattern of hivequery but this
>> approach solved my problem
>> Thanks,
>> nitin
>>
>>
>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table partitioned by local_date.  When I write a query with
>>>
>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>
>>> hive executes the UDF ahead and looks only into the specific partitions.
>>> But when the udf becomes more complex like
>>>
>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>
>>> hive looks through all the partitions even though the above function  can
>>> very well be computed ahead of time and optimize the query.  Is this
>>> behaviour intentional ? And is there a workaround other than hardcoding the
>>> date or using a param?
>>>
>>> Thanks,
>>> Ramkumar
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>

Re: nested UDFs on Partition column

Posted by Philip Tromans <ph...@gmail.com>.

I don't know what the state of Hive's partition pruning is, but I
would imagine that the problem is that the two example you're giving
are fundamentally different.

1) WHERE local_date = =date_add('2011-12-07',3) ,

the udf is a function of some constants, so the constant gets
evaluated at compile time.

2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),

unix_timestamp() is not explicitly a constant, and so the expression
won't be simplified.

I would imagine that the constant simplification code probably doesn't
know the difference between a partition column and a real column, and
so treats everything as a real column. If local_date wasn't a
partition column, then there's no reasonable way of simplifying that
predicate at compile time.

Cheers,

Phil.

On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
> as per my understanding,
>
> In this case hive needs to look for all the partitions because it does not
> have the value before hand on the partition check and note the udfs are
> executed on the mapred and not on hive client side.
>
> I would suggest you write a hive query in a file and replace the partition
> value with a variable
> something like
>
> for partitionValue in values
>
>           hive $HIVEPARAMS -hiveconf  partition=$partition -e hivequery.hql
>
> and then in hivequery.sql you can refer the variable with
>
> where column_name = '${hiveconf:partition}'
>
> I may be wrong in interpreting the execution pattern of hivequery but this
> approach solved my problem
> Thanks,
> nitin
>
>
> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
> wrote:
>>
>> Hi,
>>
>> I have a table partitioned by local_date.  When I write a query with
>>
>> WHERE local_date = =date_add('2011-12-07',3) ,
>>
>> hive executes the UDF ahead and looks only into the specific partitions.
>> But when the udf becomes more complex like
>>
>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>
>> hive looks through all the partitions even though the above function  can
>> very well be computed ahead of time and optimize the query.  Is this
>> behaviour intentional ? And is there a workaround other than hardcoding the
>> date or using a param?
>>
>> Thanks,
>> Ramkumar
>
>
>
>
> --
> Nitin Pawar
>

Re: nested UDFs on Partition column

Posted by Nitin Pawar <ni...@gmail.com>.

as per my understanding,

In this case hive needs to look for all the partitions because it does not
have the value before hand on the partition check and note the udfs are
executed on the mapred and not on hive client side.

I would suggest you write a hive query in a file and replace the partition
value with a variable
something like

for partitionValue in values

          hive $HIVEPARAMS -hiveconf  partition=$partition -e hivequery.hql

and then in hivequery.sql you can refer the variable with

where column_name = '${hiveconf:partition}'

I may be wrong in interpreting the execution pattern of hivequery but this
approach solved my problem
Thanks,
nitin

On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>wrote:

> Hi,
>
> I have a table partitioned by local_date.  When I write a query with
>
> WHERE local_date = =date_add('2011-12-07',3) ,
>
> hive executes the UDF ahead and looks only into the specific partitions.
> But when the udf becomes more complex like
>
> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> hive looks through all the partitions even though the above function  can
> very well be computed ahead of time and optimize the query.  Is this
> behaviour intentional ? And is there a workaround other than hardcoding the
> date or using a param?
>
> Thanks,
> Ramkumar
>



-- 
Nitin Pawar