You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Bryan Jeffrey <br...@gmail.com> on 2014/02/23 04:03:06 UTC

ORC queries inefficient for sorted field

Hello.

I'm running Hadoop 2.2.0 and Hive 0.12.0.

I have an ORC table partitioned by 'range', and sorted by 'time'.  I want
to select the max(time) value from a table for a given set of partitions.
 I begin with a query that looks like the following:

select max(time) from my_table where range > 1234;

The information about the minimum/maximum values for a given integer column
is in the ORC metadata.  If I run '--orcfiledump' on the files in the
specified partition(s), I can see the following output:

Column 1: count: 233056 min: 1393123416 max: 1393123499 sum: 324675782247877

However, my queries do not seem to be using the information.  The query I'm
running ends up with several hundred mappers, and takes  a very long time
to run on the data.  Running an orcfiledump on the files themselves (and
simply pulling the values for column 1) is faster by several orders of
magnitude.

I have verified that 'hive.optimize.index.filter' and 'hive.optimize.ppd'
are set to 'true'.  What can I do to avoid processing actual records for
this particular query, instead using the ORC file metadata or metastore
metadata?

Regards,

Bryan Jeffrey

Re: ORC queries inefficient for sorted field

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi Bryan

ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

You can enable hive.compute.query.using.stats flag to answer queries using metadata. When this flag is enabled, hive metastore is checked to see if column statistics exists for the required columns. If column statistics exists, then certain queries like min, max, count etc. will be answered without ever scanning the table. 

This is the JIRA that added the above feature (its available in hive version 0.13.0)
https://issues.apache.org/jira/browse/HIVE-5483

Thanks
Prasanth Jayachandran

On Feb 22, 2014, at 7:03 PM, Bryan Jeffrey <br...@gmail.com> wrote:

> Hello.
> 
> I'm running Hadoop 2.2.0 and Hive 0.12.0.
> 
> I have an ORC table partitioned by 'range', and sorted by 'time'.  I want to select the max(time) value from a table for a given set of partitions.  I begin with a query that looks like the following:
> 
> select max(time) from my_table where range > 1234;
> 
> The information about the minimum/maximum values for a given integer column is in the ORC metadata.  If I run '--orcfiledump' on the files in the specified partition(s), I can see the following output:  
> 
> Column 1: count: 233056 min: 1393123416 max: 1393123499 sum: 324675782247877
> 
> However, my queries do not seem to be using the information.  The query I'm running ends up with several hundred mappers, and takes  a very long time to run on the data.  Running an orcfiledump on the files themselves (and simply pulling the values for column 1) is faster by several orders of magnitude.  
> 
> I have verified that 'hive.optimize.index.filter' and 'hive.optimize.ppd' are set to 'true'.  What can I do to avoid processing actual records for this particular query, instead using the ORC file metadata or metastore metadata?
> 
> Regards,
> 
> Bryan Jeffrey


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.