You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Abhishek Dubey <Ab...@Xoriant.Com> on 2016/08/01 09:07:33 UTC

Doubt on Hive Partitioning.

Hi All,

I have a very big table t with billions of rows and it is partitioned on a column p. Column p  has datatype text and values like '201601', '201602'...upto '201612'.
And, I am running a query like : Select columns from t where p='201604'.

My question is : Can there be a scenario/condition/probability that my query will do a complete table scan on t instead of only reading data for specified partition key. If yes, please put some light on those scenario.

I'm asking this because someone told me that there is a probability that the query will ignore the partitioning and do a complete table scan to fetch output.

Thanks & Regards,
Abhishek Dubey

Re: Doubt on Hive Partitioning.

Posted by Gopal Vijayaraghavan <go...@apache.org>.


> WHERE p IN (SELECT p FROM t2)


> here we could argue that Hive could optimize this by computing the sub
>query first, 
> and then do the partition pruning, but sadly I don't think this
>optimisation has been implemented yet

It is implemented already -
<https://issues.apache.org/jira/browse/HIVE-7826>

In Hive-1.x, the optimization doesn't kick in when the partition column
has a UDF wrapped around it.

In Hive-2.0, it does apply even if the partition column is wrapped with a
UDF.

"explain rewrite .... where p IN (Select p from t2);"

will show the rewrite which enables DPP.

> An example of non-deterministic function are rand() and unix_timestamp()
>because it is evaluated differently at each row

Yes, that is exactly right. Another case was TO_DATE() which in Hive-1.x
returned Strings and prevented the removal of partitions.

Cheers,
Gopal

Re: Doubt on Hive Partitioning.

Posted by Furcy Pin <fu...@flaminem.com>.

Hi Abhishek,

Yes, it can happen.

The only such scenarios I can think of are when you use a WHERE clause with
a non-constant clause.
As far as I know, partition only work on constant clauses, because it has
to evaluate them *before* starting the query in order to prune the
partitions.

For instance:

WHERE p = otherColumn
> here the predicate will depend on the row being read, thus all rows must
be read.
> if otherColumn is a partition, I don't think it work either

WHERE p IN (SELECT p FROM t2)
> here we could argue that Hive could optimize this by computing the sub
query first,
> and then do the partition pruning, but sadly I don't think this
optimisation has been implemented yet

WHERE f(p) = 'constant'
or
WHERE p = f('constant')

where f is a non-deterministic or non-stateful UDF.
An example of non-deterministic function are rand() and
unix_timestamp() because
it is evaluated differently at each row

So if you want today's partition, you should use instead current_date(),
which is deterministic,  since it takes the time of compilation of the
query.
It is only available since Hive 1.2.0 though.

You can know if a Hive UDF is deterministic and stateful by looking at the
class annotation UDFType in it's source code.
If you plan on writing your own UDF, don't forget to specifiy this
annotation as well.

hope this helps,

Furcy

On Mon, Aug 1, 2016 at 11:07 AM, Abhishek Dubey <Ab...@xoriant.com>
wrote:

> Hi All,
>
>
>
> I have a very big table *t* with billions of rows and it is partitioned
> on a column *p*. Column *p * has datatype text and values like ‘201601’,
> ‘201602’…upto ‘201612’.
>
> And, I am running a query like : *Select columns from t where p=’201604’.*
>
>
>
> My question is : Can there be a scenario/condition/probability that my
> query will do a complete table scan on *t* instead of only reading data
> for specified partition key. If yes, please put some light on those
> scenario.
>
>
>
> I’m asking this because someone told me that there is a probability that
> the query will ignore the partitioning and do a complete table scan to
> fetch output.
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>

Re: Doubt on Hive Partitioning.

Posted by Jörn Franke <jo...@gmail.com>.

Partition pruning works also with older Hive version, but you have to put the filter in the join statement and not only in the where statement 

> On 02 Aug 2016, at 09:53, Furcy Pin <fu...@flaminem.com> wrote:
> 
> I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to work.
> 
> Since MR is deprecated in 2.0, I assume we should not expect any future perf optimisation on this side.
> 
> It has been implemented for Hive on Spark, though.
> https://issues.apache.org/jira/browse/HIVE-9152
> 
> 
> 
> 
>> On Tue, Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian <qi...@gmail.com> wrote:
>> Is this partition pruning fixed in MR too except for TEZ in newer hive version?
>> 
>> Regards,
>> Q
>> 
>>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke <jo...@gmail.com> wrote:
>>> It happens in old hive version of the filter is only in the where clause and NOT in the join clause. This should not happen in newer hive version. You can check it by executing explain dependency query. 
>>> 
>>>> On 01 Aug 2016, at 11:07, Abhishek Dubey <Ab...@Xoriant.Com> wrote:
>>>> 
>>>> Hi All,
>>>> 
>>>>  
>>>> 
>>>> I have a very big table t with billions of rows and it is partitioned on a column p. Column p  has datatype text and values like ‘201601’, ‘201602’…upto ‘201612’.
>>>> 
>>>> And, I am running a query like : Select columns from t where p=’201604’.
>>>> 
>>>>  
>>>> 
>>>> My question is : Can there be a scenario/condition/probability that my query will do a complete table scan on t instead of only reading data for specified partition key. If yes, please put some light on those scenario.
>>>> 
>>>>  
>>>> 
>>>> I’m asking this because someone told me that there is a probability that the query will ignore the partitioning and do a complete table scan to fetch output.
>>>> 
>>>>  
>>>> 
>>>> 
>>>> Thanks & Regards,
>>>> Abhishek Dubey
>

Re: Doubt on Hive Partitioning.

Posted by Furcy Pin <fu...@flaminem.com>.

I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to
work.

Since MR is deprecated in 2.0, I assume we should not expect any future
perf optimisation on this side.

It has been implemented for Hive on Spark, though.
https://issues.apache.org/jira/browse/HIVE-9152




On Tue, Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian <qi...@gmail.com>
wrote:

> Is this partition pruning fixed in MR too except for TEZ in newer hive
> version?
>
> Regards,
> Q
>
> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> It happens in old hive version of the filter is only in the where clause
>> and NOT in the join clause. This should not happen in newer hive version.
>> You can check it by executing explain dependency query.
>>
>> On 01 Aug 2016, at 11:07, Abhishek Dubey <Abhishek.Dubey@Xoriant.Com
>> <Ab...@xoriant.com>> wrote:
>>
>> Hi All,
>>
>>
>>
>> I have a very big table *t* with billions of rows and it is partitioned
>> on a column *p*. Column *p * has datatype text and values like ‘201601’,
>> ‘201602’…upto ‘201612’.
>>
>> And, I am running a query like : *Select columns from t where
>> p=’201604’.*
>>
>>
>>
>> My question is : Can there be a scenario/condition/probability that my
>> query will do a complete table scan on *t* instead of only reading data
>> for specified partition key. If yes, please put some light on those
>> scenario.
>>
>>
>>
>> I’m asking this because someone told me that there is a probability that
>> the query will ignore the partitioning and do a complete table scan to
>> fetch output.
>>
>>
>>
>> *Thanks & Regards,*
>> *Abhishek Dubey*
>>
>>
>>
>>
>

Re: Doubt on Hive Partitioning.

Posted by Jörn Franke <jo...@gmail.com>.

I do not think so, but never tested it.

> On 02 Aug 2016, at 03:45, Qiuzhuang Lian <qi...@gmail.com> wrote:
> 
> Is this partition pruning fixed in MR too except for TEZ in newer hive version?
> 
> Regards,
> Q
> 
>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke <jo...@gmail.com> wrote:
>> It happens in old hive version of the filter is only in the where clause and NOT in the join clause. This should not happen in newer hive version. You can check it by executing explain dependency query. 
>> 
>>> On 01 Aug 2016, at 11:07, Abhishek Dubey <Ab...@Xoriant.Com> wrote:
>>> 
>>> Hi All,
>>> 
>>>  
>>> 
>>> I have a very big table t with billions of rows and it is partitioned on a column p. Column p  has datatype text and values like ‘201601’, ‘201602’…upto ‘201612’.
>>> 
>>> And, I am running a query like : Select columns from t where p=’201604’.
>>> 
>>>  
>>> 
>>> My question is : Can there be a scenario/condition/probability that my query will do a complete table scan on t instead of only reading data for specified partition key. If yes, please put some light on those scenario.
>>> 
>>>  
>>> 
>>> I’m asking this because someone told me that there is a probability that the query will ignore the partitioning and do a complete table scan to fetch output.
>>> 
>>>  
>>> 
>>> 
>>> Thanks & Regards,
>>> Abhishek Dubey
>

Re: Doubt on Hive Partitioning.

Posted by Qiuzhuang Lian <qi...@gmail.com>.

Is this partition pruning fixed in MR too except for TEZ in newer hive
version?

Regards,
Q

On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke <jo...@gmail.com> wrote:

> It happens in old hive version of the filter is only in the where clause
> and NOT in the join clause. This should not happen in newer hive version.
> You can check it by executing explain dependency query.
>
> On 01 Aug 2016, at 11:07, Abhishek Dubey <Abhishek.Dubey@Xoriant.Com
> <Ab...@xoriant.com>> wrote:
>
> Hi All,
>
>
>
> I have a very big table *t* with billions of rows and it is partitioned
> on a column *p*. Column *p * has datatype text and values like ‘201601’,
> ‘201602’…upto ‘201612’.
>
> And, I am running a query like : *Select columns from t where p=’201604’.*
>
>
>
> My question is : Can there be a scenario/condition/probability that my
> query will do a complete table scan on *t* instead of only reading data
> for specified partition key. If yes, please put some light on those
> scenario.
>
>
>
> I’m asking this because someone told me that there is a probability that
> the query will ignore the partitioning and do a complete table scan to
> fetch output.
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>
>

Re: Doubt on Hive Partitioning.

Posted by Jörn Franke <jo...@gmail.com>.

It happens in old hive version of the filter is only in the where clause and NOT in the join clause. This should not happen in newer hive version. You can check it by executing explain dependency query. 

> On 01 Aug 2016, at 11:07, Abhishek Dubey <Ab...@Xoriant.Com> wrote:
> 
> Hi All,
>  
> I have a very big table t with billions of rows and it is partitioned on a column p. Column p  has datatype text and values like ‘201601’, ‘201602’…upto ‘201612’.
> And, I am running a query like : Select columns from t where p=’201604’.
>  
> My question is : Can there be a scenario/condition/probability that my query will do a complete table scan on t instead of only reading data for specified partition key. If yes, please put some light on those scenario.
>  
> I’m asking this because someone told me that there is a probability that the query will ignore the partitioning and do a complete table scan to fetch output.
>  
> Thanks & Regards,
> Abhishek Dubey
>