You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ramkumar <ra...@yahoo.com> on 2012/04/19 11:57:03 UTC
nested UDFs on Partition column
Hi,
I have a table partitioned by local_date. When I write a query with
WHERE local_date = =date_add('2011-12-07',3) ,
hive executes the UDF ahead and looks only into the specific partitions. But when the udf becomes more complex like
WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
hive looks through all the partitions even though the above function can very well be computed ahead of time and optimize the query. Is this behaviour intentional ? And is there a workaround other than hardcoding the date or using a param?
Thanks,
Ramkumar
Re: nested UDFs on Partition column
Posted by Edward Capriolo <ed...@gmail.com>.
@Deterministic was added after many of the UDFs some that are
deterministic may not be marked as such.
On Tue, May 1, 2012 at 9:38 AM, Ramkumar <ra...@yahoo.com> wrote:
> Thanks. Yes, thats what the UDF writers seemed to have intended. The way
> this has been done is by using the 'deterministic' attribute in UDFType
> annotation.
> See
> http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/ql/udf/UDFUnixTimeStamp.java.html
>
> ________________________________
> From: Philip Tromans <ph...@gmail.com>
> To: user@hive.apache.org
> Sent: Thursday, April 19, 2012 4:38 PM
> Subject: Re: nested UDFs on Partition column
>
> In fact, it's just not a reasonable thing to do partition pruning on.
> Imagine a situation where you had:
>
> WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
> f(unix_timestamp).
>
> The right hand side of the predicate has to be evaluated at map-time,
> whereas you're assuming that left hand side should be evaluated at
> compile time, which means you have two different values of
> unix_timestamp() floating around, which can only end badly.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 16:35, Philip Tromans <ph...@gmail.com> wrote:
>> I don't know what the state of Hive's partition pruning is, but I
>> would imagine that the problem is that the two example you're giving
>> are fundamentally different.
>>
>> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>>
>> the udf is a function of some constants, so the constant gets
>> evaluated at compile time.
>>
>> 2) WHERE local_date =
>> date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>
>> unix_timestamp() is not explicitly a constant, and so the expression
>> won't be simplified.
>>
>> I would imagine that the constant simplification code probably doesn't
>> know the difference between a partition column and a real column, and
>> so treats everything as a real column. If local_date wasn't a
>> partition column, then there's no reasonable way of simplifying that
>> predicate at compile time.
>>
>> Cheers,
>>
>> Phil.
>>
>> On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
>>> as per my understanding,
>>>
>>> In this case hive needs to look for all the partitions because it does
>>> not
>>> have the value before hand on the partition check and note the udfs are
>>> executed on the mapred and not on hive client side.
>>>
>>> I would suggest you write a hive query in a file and replace the
>>> partition
>>> value with a variable
>>> something like
>>>
>>> for partitionValue in values
>>>
>>> hive $HIVEPARAMS -hiveconf partition=$partition -e
>>> hivequery.hql
>>>
>>> and then in hivequery.sql you can refer the variable with
>>>
>>> where column_name = '${hiveconf:partition}'
>>>
>>> I may be wrong in interpreting the execution pattern of hivequery but
>>> this
>>> approach solved my problem
>>> Thanks,
>>> nitin
>>>
>>>
>>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a table partitioned by local_date. When I write a query with
>>>>
>>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>>
>>>> hive executes the UDF ahead and looks only into the specific partitions.
>>>> But when the udf becomes more complex like
>>>>
>>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>>
>>>> hive looks through all the partitions even though the above function
>>>> can
>>>> very well be computed ahead of time and optimize the query. Is this
>>>> behaviour intentional ? And is there a workaround other than hardcoding
>>>> the
>>>> date or using a param?
>>>>
>>>> Thanks,
>>>> Ramkumar
>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>
>
Re: nested UDFs on Partition column
Posted by Ramkumar <ra...@yahoo.com>.
Thanks. Yes, thats what the UDF writers seemed to have intended. The way this has been done is by using the 'deterministic' attribute in UDFType annotation.
See
http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/ql/udf/UDFUnixTimeStamp.java.html
________________________________
From: Philip Tromans <ph...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 19, 2012 4:38 PM
Subject: Re: nested UDFs on Partition column
In fact, it's just not a reasonable thing to do partition pruning on.
Imagine a situation where you had:
WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
f(unix_timestamp).
The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.
Cheers,
Phil.
On 19 April 2012 16:35, Philip Tromans <ph...@gmail.com> wrote:
> I don't know what the state of Hive's partition pruning is, but I
> would imagine that the problem is that the two example you're giving
> are fundamentally different.
>
> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>
> the udf is a
function of some constants, so the constant gets
> evaluated at compile time.
>
> 2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> unix_timestamp() is not explicitly a constant, and so the expression
> won't be simplified.
>
> I would imagine that the constant simplification code probably doesn't
> know the difference between a partition column and a real column, and
> so treats everything as a real column. If local_date wasn't a
> partition column, then there's no reasonable way of simplifying that
> predicate at compile time.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
>> as per my understanding,
>>
>> In this case hive needs to look for all the
partitions because it does not
>> have the value before hand on the partition check and note the udfs are
>> executed on the mapred and not on hive client side.
>>
>> I would suggest you write a hive query in a file and replace the partition
>> value with a variable
>> something like
>>
>> for partitionValue in values
>>
>> hive $HIVEPARAMS -hiveconf partition=$partition -e hivequery.hql
>>
>> and then in hivequery.sql you can refer the variable with
>>
>> where column_name = '${hiveconf:partition}'
>>
>> I may be wrong in interpreting the execution pattern of hivequery but this
>> approach solved my problem
>> Thanks,
>> nitin
>>
>>
>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table partitioned by local_date. When I write a query with
>>>
>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>
>>> hive executes the UDF ahead and looks only into the specific partitions.
>>> But when the udf becomes more complex like
>>>
>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>
>>> hive looks through all the partitions even though the above function can
>>> very well be computed ahead of time and optimize the query. Is this
>>> behaviour intentional ? And is there a workaround other than hardcoding the
>>> date or using a
param?
>>>
>>> Thanks,
>>> Ramkumar
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
Re: nested UDFs on Partition column
Posted by Philip Tromans <ph...@gmail.com>.
In fact, it's just not a reasonable thing to do partition pruning on.
Imagine a situation where you had:
WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
f(unix_timestamp).
The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.
Cheers,
Phil.
On 19 April 2012 16:35, Philip Tromans <ph...@gmail.com> wrote:
> I don't know what the state of Hive's partition pruning is, but I
> would imagine that the problem is that the two example you're giving
> are fundamentally different.
>
> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>
> the udf is a function of some constants, so the constant gets
> evaluated at compile time.
>
> 2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> unix_timestamp() is not explicitly a constant, and so the expression
> won't be simplified.
>
> I would imagine that the constant simplification code probably doesn't
> know the difference between a partition column and a real column, and
> so treats everything as a real column. If local_date wasn't a
> partition column, then there's no reasonable way of simplifying that
> predicate at compile time.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
>> as per my understanding,
>>
>> In this case hive needs to look for all the partitions because it does not
>> have the value before hand on the partition check and note the udfs are
>> executed on the mapred and not on hive client side.
>>
>> I would suggest you write a hive query in a file and replace the partition
>> value with a variable
>> something like
>>
>> for partitionValue in values
>>
>> hive $HIVEPARAMS -hiveconf partition=$partition -e hivequery.hql
>>
>> and then in hivequery.sql you can refer the variable with
>>
>> where column_name = '${hiveconf:partition}'
>>
>> I may be wrong in interpreting the execution pattern of hivequery but this
>> approach solved my problem
>> Thanks,
>> nitin
>>
>>
>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table partitioned by local_date. When I write a query with
>>>
>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>
>>> hive executes the UDF ahead and looks only into the specific partitions.
>>> But when the udf becomes more complex like
>>>
>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>
>>> hive looks through all the partitions even though the above function can
>>> very well be computed ahead of time and optimize the query. Is this
>>> behaviour intentional ? And is there a workaround other than hardcoding the
>>> date or using a param?
>>>
>>> Thanks,
>>> Ramkumar
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
Re: nested UDFs on Partition column
Posted by Philip Tromans <ph...@gmail.com>.
I don't know what the state of Hive's partition pruning is, but I
would imagine that the problem is that the two example you're giving
are fundamentally different.
1) WHERE local_date = =date_add('2011-12-07',3) ,
the udf is a function of some constants, so the constant gets
evaluated at compile time.
2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
unix_timestamp() is not explicitly a constant, and so the expression
won't be simplified.
I would imagine that the constant simplification code probably doesn't
know the difference between a partition column and a real column, and
so treats everything as a real column. If local_date wasn't a
partition column, then there's no reasonable way of simplifying that
predicate at compile time.
Cheers,
Phil.
On 19 April 2012 11:50, Nitin Pawar <ni...@gmail.com> wrote:
> as per my understanding,
>
> In this case hive needs to look for all the partitions because it does not
> have the value before hand on the partition check and note the udfs are
> executed on the mapred and not on hive client side.
>
> I would suggest you write a hive query in a file and replace the partition
> value with a variable
> something like
>
> for partitionValue in values
>
> hive $HIVEPARAMS -hiveconf partition=$partition -e hivequery.hql
>
> and then in hivequery.sql you can refer the variable with
>
> where column_name = '${hiveconf:partition}'
>
> I may be wrong in interpreting the execution pattern of hivequery but this
> approach solved my problem
> Thanks,
> nitin
>
>
> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>
> wrote:
>>
>> Hi,
>>
>> I have a table partitioned by local_date. When I write a query with
>>
>> WHERE local_date = =date_add('2011-12-07',3) ,
>>
>> hive executes the UDF ahead and looks only into the specific partitions.
>> But when the udf becomes more complex like
>>
>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>
>> hive looks through all the partitions even though the above function can
>> very well be computed ahead of time and optimize the query. Is this
>> behaviour intentional ? And is there a workaround other than hardcoding the
>> date or using a param?
>>
>> Thanks,
>> Ramkumar
>
>
>
>
> --
> Nitin Pawar
>
Re: nested UDFs on Partition column
Posted by Nitin Pawar <ni...@gmail.com>.
as per my understanding,
In this case hive needs to look for all the partitions because it does not
have the value before hand on the partition check and note the udfs are
executed on the mapred and not on hive client side.
I would suggest you write a hive query in a file and replace the partition
value with a variable
something like
for partitionValue in values
hive $HIVEPARAMS -hiveconf partition=$partition -e hivequery.hql
and then in hivequery.sql you can refer the variable with
where column_name = '${hiveconf:partition}'
I may be wrong in interpreting the execution pattern of hivequery but this
approach solved my problem
Thanks,
nitin
On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ra...@yahoo.com>wrote:
> Hi,
>
> I have a table partitioned by local_date. When I write a query with
>
> WHERE local_date = =date_add('2011-12-07',3) ,
>
> hive executes the UDF ahead and looks only into the specific partitions.
> But when the udf becomes more complex like
>
> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> hive looks through all the partitions even though the above function can
> very well be computed ahead of time and optimize the query. Is this
> behaviour intentional ? And is there a workaround other than hardcoding the
> date or using a param?
>
> Thanks,
> Ramkumar
>
--
Nitin Pawar