You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Bryan Jeffrey <br...@gmail.com> on 2014/02/23 04:03:06 UTC
ORC queries inefficient for sorted field
Hello.
I'm running Hadoop 2.2.0 and Hive 0.12.0.
I have an ORC table partitioned by 'range', and sorted by 'time'. I want
to select the max(time) value from a table for a given set of partitions.
I begin with a query that looks like the following:
select max(time) from my_table where range > 1234;
The information about the minimum/maximum values for a given integer column
is in the ORC metadata. If I run '--orcfiledump' on the files in the
specified partition(s), I can see the following output:
Column 1: count: 233056 min: 1393123416 max: 1393123499 sum: 324675782247877
However, my queries do not seem to be using the information. The query I'm
running ends up with several hundred mappers, and takes a very long time
to run on the data. Running an orcfiledump on the files themselves (and
simply pulling the values for column 1) is faster by several orders of
magnitude.
I have verified that 'hive.optimize.index.filter' and 'hive.optimize.ppd'
are set to 'true'. What can I do to avoid processing actual records for
this particular query, instead using the ORC file metadata or metastore
metadata?
Regards,
Bryan Jeffrey
Re: ORC queries inefficient for sorted field
Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Hi Bryan
ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
You can enable hive.compute.query.using.stats flag to answer queries using metadata. When this flag is enabled, hive metastore is checked to see if column statistics exists for the required columns. If column statistics exists, then certain queries like min, max, count etc. will be answered without ever scanning the table.
This is the JIRA that added the above feature (its available in hive version 0.13.0)
https://issues.apache.org/jira/browse/HIVE-5483
Thanks
Prasanth Jayachandran
On Feb 22, 2014, at 7:03 PM, Bryan Jeffrey <br...@gmail.com> wrote:
> Hello.
>
> I'm running Hadoop 2.2.0 and Hive 0.12.0.
>
> I have an ORC table partitioned by 'range', and sorted by 'time'. I want to select the max(time) value from a table for a given set of partitions. I begin with a query that looks like the following:
>
> select max(time) from my_table where range > 1234;
>
> The information about the minimum/maximum values for a given integer column is in the ORC metadata. If I run '--orcfiledump' on the files in the specified partition(s), I can see the following output:
>
> Column 1: count: 233056 min: 1393123416 max: 1393123499 sum: 324675782247877
>
> However, my queries do not seem to be using the information. The query I'm running ends up with several hundred mappers, and takes a very long time to run on the data. Running an orcfiledump on the files themselves (and simply pulling the values for column 1) is faster by several orders of magnitude.
>
> I have verified that 'hive.optimize.index.filter' and 'hive.optimize.ppd' are set to 'true'. What can I do to avoid processing actual records for this particular query, instead using the ORC file metadata or metastore metadata?
>
> Regards,
>
> Bryan Jeffrey
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.