You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Bryan Jeffrey <br...@gmail.com> on 2014/05/16 14:35:09 UTC

Query Using Stats

All,

I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.

/opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
-n root --hiveconf hive.compute.query.using.stats=true -e "select
min(seconds), max(seconds), range from data where range > 1400204700 group
by range"

'range' above is our partition.  I would expect that this would provide a
reasonably fast response time by simply looking at the metadata for each
file in a given partition (maybe one mapper per range).  Instead we're
seeing 140+ mappers, and the query takes a long time.

Here is the explain plan:

/opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
-n root --hiveconf hive.compute.query.using.stats=true -e "explain select
min(seconds), max(seconds), range from data where range > 1400204700 group
by range"
scan complete in 4ms
Connecting to jdbc:hive2://server:10002/database
Connected to: Apache Hive (version 0.13.0)
Driver: Hive JDBC (version 0.13.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+---------------------------------------------------------------------------------------------------------------------+
|                                                       Explain
                                          |
+---------------------------------------------------------------------------------------------------------------------+
| STAGE DEPENDENCIES:
                                          |
|   Stage-1 is a root stage
                                          |
|   Stage-0 is a root stage
                                          |
|
                                          |
| STAGE PLANS:
                                           |
|   Stage: Stage-1
                                           |
|     Map Reduce
                                           |
|       Map Operator Tree:
                                           |
|           TableScan
                                          |
|             alias: data
                                         |
|             Statistics: Num rows: 4860251901 Data size: 38882015268 Basic
stats: PARTIAL Column stats: NONE         |
|             Select Operator
                                          |
|               expressions: range (type: int), seconds (type: bigint)
                                  |
|               outputColumnNames: range, seconds
                                 |
|               Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: PARTIAL Column stats: NONE       |
|               Group By Operator
                                          |
|                 aggregations: min(seconds), max(end_time_seconds)
                                 |
|                 keys: range (type: int)
                                          |
|                 mode: hash
                                           |
|                 outputColumnNames: _col0, _col1, _col2
                                           |
|                 Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: PARTIAL Column stats: NONE     |
|                 Reduce Output Operator
                                           |
|                   key expressions: _col0 (type: int)
                                           |
|                   sort order: +
                                          |
|                   Map-reduce partition columns: _col0 (type: int)
                                          |
|                   Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: COMPLETE Column stats: NONE  |
|                   value expressions: _col1 (type: bigint), _col2 (type:
bigint)                                     |
|       Reduce Operator Tree:
                                          |
|         Group By Operator
                                          |
|           aggregations: min(VALUE._col0), max(VALUE._col1)
                                           |
|           keys: KEY._col0 (type: int)
                                          |
|           mode: mergepartial
                                           |
|           outputColumnNames: _col0, _col1, _col2
                                           |
|           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
stats: COMPLETE Column stats: NONE          |
|           Select Operator
                                          |
|             expressions: _col1 (type: bigint), _col2 (type: bigint),
_col0 (type: int)                              |
|             outputColumnNames: _col0, _col1, _col2
                                           |
|             Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
stats: COMPLETE Column stats: NONE        |
|             File Output Operator
                                           |
|               compressed: false
                                          |
|               Statistics: Num rows: 2430125950 Data size: 19441007630
Basic stats: COMPLETE Column stats: NONE      |
|               table:
                                           |
|                   input format: org.apache.hadoop.mapred.TextInputFormat
                                           |
|                   output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        |
|                   serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                |
|
                                          |
|   Stage: Stage-0
                                           |
|     Fetch Operator
                                           |
|       limit: -1
                                          |
|
                                          |
+---------------------------------------------------------------------------------------------------------------------+
50 rows selected (0.417 seconds)
Beeline version 0.13.0 by Apache Hive
Closing: 0: jdbc:hive2://viper:10002/intrepid

Can anyone enlighten me as to how this could be optimized?

Regards,

Bryan Jeffrey

Re: Query Using Stats

Posted by Bryan Jeffrey <br...@gmail.com>.

Prasanth,

I had the correct flag enabled (see query in original email). Issue is that
it does not appear to be correctly using partition stats for the
calculation. Table is an orc table. It appears in the log that stats are
being calculated, but does not appear to be working when queries are run
against tables.

Regards,

Bryan Jeffrey
On May 16, 2014 8:19 PM, "Prasanth Jayachandran" <
pjayachandran@hortonworks.com> wrote:

> Bryan,
>
> The flag you are looking for is hive.compute.query.using.stats. By
> default this optimization is disabled. You might need to enable it to use
> it. Also the min/max/sum metadata are not looked up from the file but
> instead from metastore. Although file formats like ORC contains stats, they
> are not used to answer metadata only queries. Hive considers metastore as
> the only source of truth for answering such queries. You can look at this
> jira for further details https://issues.apache.org/jira/browse/HIVE-5483
>
> Thanks
> Prasanth Jayachandran
>
> On May 16, 2014, at 5:35 AM, Bryan Jeffrey <br...@gmail.com>
> wrote:
>
> All,
>
> I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "select
> min(seconds), max(seconds), range from data where range > 1400204700 group
> by range"
>
> 'range' above is our partition.  I would expect that this would provide a
> reasonably fast response time by simply looking at the metadata for each
> file in a given partition (maybe one mapper per range).  Instead we're
> seeing 140+ mappers, and the query takes a long time.
>
> Here is the explain plan:
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "explain select
> min(seconds), max(seconds), range from data where range > 1400204700
> group by range"
> scan complete in 4ms
> Connecting to jdbc:hive2://server:10002/database
> Connected to: Apache Hive (version 0.13.0)
> Driver: Hive JDBC (version 0.13.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> +---------------------------------------------------------------------------------------------------------------------+
> |                                                       Explain
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> | STAGE DEPENDENCIES:
>                                             |
> |   Stage-1 is a root stage
>                                             |
> |   Stage-0 is a root stage
>                                             |
> |
>                                             |
> | STAGE PLANS:
>                                            |
> |   Stage: Stage-1
>                                            |
> |     Map Reduce
>                                            |
> |       Map Operator Tree:
>                                            |
> |           TableScan
>                                             |
> |             alias: data
>                                            |
> |             Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE         |
> |             Select Operator
>                                             |
> |               expressions: range (type: int), seconds (type: bigint)
>                                   |
> |               outputColumnNames: range, seconds
>                                    |
> |               Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE       |
> |               Group By Operator
>                                             |
> |                 aggregations: min(seconds), max(end_time_seconds)
>                                    |
> |                 keys: range (type: int)
>                                             |
> |                 mode: hash
>                                            |
> |                 outputColumnNames: _col0, _col1, _col2
>                                            |
> |                 Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE     |
> |                 Reduce Output Operator
>                                            |
> |                   key expressions: _col0 (type: int)
>                                            |
> |                   sort order: +
>                                             |
> |                   Map-reduce partition columns: _col0 (type: int)
>                                             |
> |                   Statistics: Num rows: 4860251901 Data size:
> 38882015268 Basic stats: COMPLETE Column stats: NONE  |
> |                   value expressions: _col1 (type: bigint), _col2 (type:
> bigint)                                     |
> |       Reduce Operator Tree:
>                                             |
> |         Group By Operator
>                                             |
> |           aggregations: min(VALUE._col0), max(VALUE._col1)
>                                            |
> |           keys: KEY._col0 (type: int)
>                                             |
> |           mode: mergepartial
>                                            |
> |           outputColumnNames: _col0, _col1, _col2
>                                            |
> |           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
> stats: COMPLETE Column stats: NONE          |
> |           Select Operator
>                                             |
> |             expressions: _col1 (type: bigint), _col2 (type: bigint),
> _col0 (type: int)                              |
> |             outputColumnNames: _col0, _col1, _col2
>                                            |
> |             Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE        |
> |             File Output Operator
>                                            |
> |               compressed: false
>                                             |
> |               Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE      |
> |               table:
>                                            |
> |                   input format: org.apache.hadoop.mapred.TextInputFormat
>                                            |
> |                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>         |
> |                   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 |
> |
>                                             |
> |   Stage: Stage-0
>                                            |
> |     Fetch Operator
>                                            |
> |       limit: -1
>                                             |
> |
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> 50 rows selected (0.417 seconds)
> Beeline version 0.13.0 by Apache Hive
> Closing: 0: jdbc:hive2://viper:10002/intrepid
>
> Can anyone enlighten me as to how this could be optimized?
>
> Regards,
>
> Bryan Jeffrey
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Query Using Stats

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Bryan,

The flag you are looking for is hive.compute.query.using.stats. By default this optimization is disabled. You might need to enable it to use it. Also the min/max/sum metadata are not looked up from the file but instead from metastore. Although file formats like ORC contains stats, they are not used to answer metadata only queries. Hive considers metastore as the only source of truth for answering such queries. You can look at this jira for further details https://issues.apache.org/jira/browse/HIVE-5483

Thanks
Prasanth Jayachandran

On May 16, 2014, at 5:35 AM, Bryan Jeffrey <br...@gmail.com> wrote:

> All,
> 
> I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
> 
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n root --hiveconf hive.compute.query.using.stats=true -e "select min(seconds), max(seconds), range from data where range > 1400204700 group by range"
> 
> 'range' above is our partition.  I would expect that this would provide a reasonably fast response time by simply looking at the metadata for each file in a given partition (maybe one mapper per range).  Instead we're seeing 140+ mappers, and the query takes a long time.
> 
> Here is the explain plan:
> 
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n root --hiveconf hive.compute.query.using.stats=true -e "explain select min(seconds), max(seconds), range from data where range > 1400204700 group by range"
> scan complete in 4ms
> Connecting to jdbc:hive2://server:10002/database
> Connected to: Apache Hive (version 0.13.0)
> Driver: Hive JDBC (version 0.13.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +---------------------------------------------------------------------------------------------------------------------+
> |                                                       Explain                                                       |
> +---------------------------------------------------------------------------------------------------------------------+
> | STAGE DEPENDENCIES:                                                                                                 |
> |   Stage-1 is a root stage                                                                                           |
> |   Stage-0 is a root stage                                                                                           |
> |                                                                                                                     |
> | STAGE PLANS:                                                                                                        |
> |   Stage: Stage-1                                                                                                    |
> |     Map Reduce                                                                                                      |
> |       Map Operator Tree:                                                                                            |
> |           TableScan                                                                                                 |
> |             alias: data                                                                                            |
> |             Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: PARTIAL Column stats: NONE         |
> |             Select Operator                                                                                         |
> |               expressions: range (type: int), seconds (type: bigint)                                       |
> |               outputColumnNames: range, seconds                                                            |
> |               Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: PARTIAL Column stats: NONE       |
> |               Group By Operator                                                                                     |
> |                 aggregations: min(seconds), max(end_time_seconds)                                          |
> |                 keys: range (type: int)                                                                             |
> |                 mode: hash                                                                                          |
> |                 outputColumnNames: _col0, _col1, _col2                                                              |
> |                 Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: PARTIAL Column stats: NONE     |
> |                 Reduce Output Operator                                                                              |
> |                   key expressions: _col0 (type: int)                                                                |
> |                   sort order: +                                                                                     |
> |                   Map-reduce partition columns: _col0 (type: int)                                                   |
> |                   Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: COMPLETE Column stats: NONE  |
> |                   value expressions: _col1 (type: bigint), _col2 (type: bigint)                                     |
> |       Reduce Operator Tree:                                                                                         |
> |         Group By Operator                                                                                           |
> |           aggregations: min(VALUE._col0), max(VALUE._col1)                                                          |
> |           keys: KEY._col0 (type: int)                                                                               |
> |           mode: mergepartial                                                                                        |
> |           outputColumnNames: _col0, _col1, _col2                                                                    |
> |           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE Column stats: NONE          |
> |           Select Operator                                                                                           |
> |             expressions: _col1 (type: bigint), _col2 (type: bigint), _col0 (type: int)                              |
> |             outputColumnNames: _col0, _col1, _col2                                                                  |
> |             Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE Column stats: NONE        |
> |             File Output Operator                                                                                    |
> |               compressed: false                                                                                     |
> |               Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE Column stats: NONE      |
> |               table:                                                                                                |
> |                   input format: org.apache.hadoop.mapred.TextInputFormat                                            |
> |                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                         |
> |                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                         |
> |                                                                                                                     |
> |   Stage: Stage-0                                                                                                    |
> |     Fetch Operator                                                                                                  |
> |       limit: -1                                                                                                     |
> |                                                                                                                     |
> +---------------------------------------------------------------------------------------------------------------------+
> 50 rows selected (0.417 seconds)
> Beeline version 0.13.0 by Apache Hive
> Closing: 0: jdbc:hive2://viper:10002/intrepid
> 
> Can anyone enlighten me as to how this could be optimized?
> 
> Regards,
> 
> Bryan Jeffrey
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Query Using Stats

Posted by Edward Capriolo <ed...@gmail.com>.

Hive does not know that the values of column `seconds` and partition
`range` or related.

Hive can only use the WHERE clause to remove partitions that do not match
the range criteria. All the data inside the partition is not ordered in any
way so the minimum seconds and maximum seconds could be in any part of any
file.


On Fri, May 16, 2014 at 8:35 AM, Bryan Jeffrey <br...@gmail.com>wrote:

> All,
>
> I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "select
> min(seconds), max(seconds), range from data where range > 1400204700 group
> by range"
>
> 'range' above is our partition.  I would expect that this would provide a
> reasonably fast response time by simply looking at the metadata for each
> file in a given partition (maybe one mapper per range).  Instead we're
> seeing 140+ mappers, and the query takes a long time.
>
> Here is the explain plan:
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "explain select
> min(seconds), max(seconds), range from data where range > 1400204700
> group by range"
> scan complete in 4ms
> Connecting to jdbc:hive2://server:10002/database
> Connected to: Apache Hive (version 0.13.0)
> Driver: Hive JDBC (version 0.13.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> +---------------------------------------------------------------------------------------------------------------------+
> |                                                       Explain
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> | STAGE DEPENDENCIES:
>                                             |
> |   Stage-1 is a root stage
>                                             |
> |   Stage-0 is a root stage
>                                             |
> |
>                                             |
> | STAGE PLANS:
>                                            |
> |   Stage: Stage-1
>                                            |
> |     Map Reduce
>                                            |
> |       Map Operator Tree:
>                                            |
> |           TableScan
>                                             |
> |             alias: data
>                                            |
> |             Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE         |
> |             Select Operator
>                                             |
> |               expressions: range (type: int), seconds (type: bigint)
>                                   |
> |               outputColumnNames: range, seconds
>                                    |
> |               Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE       |
> |               Group By Operator
>                                             |
> |                 aggregations: min(seconds), max(end_time_seconds)
>                                    |
> |                 keys: range (type: int)
>                                             |
> |                 mode: hash
>                                            |
> |                 outputColumnNames: _col0, _col1, _col2
>                                            |
> |                 Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE     |
> |                 Reduce Output Operator
>                                            |
> |                   key expressions: _col0 (type: int)
>                                            |
> |                   sort order: +
>                                             |
> |                   Map-reduce partition columns: _col0 (type: int)
>                                             |
> |                   Statistics: Num rows: 4860251901 Data size:
> 38882015268 Basic stats: COMPLETE Column stats: NONE  |
> |                   value expressions: _col1 (type: bigint), _col2 (type:
> bigint)                                     |
> |       Reduce Operator Tree:
>                                             |
> |         Group By Operator
>                                             |
> |           aggregations: min(VALUE._col0), max(VALUE._col1)
>                                            |
> |           keys: KEY._col0 (type: int)
>                                             |
> |           mode: mergepartial
>                                            |
> |           outputColumnNames: _col0, _col1, _col2
>                                            |
> |           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
> stats: COMPLETE Column stats: NONE          |
> |           Select Operator
>                                             |
> |             expressions: _col1 (type: bigint), _col2 (type: bigint),
> _col0 (type: int)                              |
> |             outputColumnNames: _col0, _col1, _col2
>                                            |
> |             Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE        |
> |             File Output Operator
>                                            |
> |               compressed: false
>                                             |
> |               Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE      |
> |               table:
>                                            |
> |                   input format: org.apache.hadoop.mapred.TextInputFormat
>                                            |
> |                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>         |
> |                   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 |
> |
>                                             |
> |   Stage: Stage-0
>                                            |
> |     Fetch Operator
>                                            |
> |       limit: -1
>                                             |
> |
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> 50 rows selected (0.417 seconds)
> Beeline version 0.13.0 by Apache Hive
> Closing: 0: jdbc:hive2://viper:10002/intrepid
>
> Can anyone enlighten me as to how this could be optimized?
>
> Regards,
>
> Bryan Jeffrey
>
>
>