You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Suma Shivaprasad <su...@gmail.com> on 2014/07/24 09:02:42 UTC

Column Stats with parquet

I am trying to enable Column statistics usage with Parquet tables. This is
the query I am executing. However on explain, I see that even though *Basic
stats: COMPLETE *is seen *Column stats *is seen as* NONE.*
Can someone please explain what else I need to debug/fix this.

set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;

analyze table user_table partition(dt='2014-06-01',hour='00') compute
statistics;

explain select min(a), max(b), min(c) from user_table;

hive> explain select min(a), max(b), min(c) from usertable;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: user_table
            Statistics: Num rows: 55490383 Data size: 1831182639 *Basic
stats: COMPLETE Column stats: NONE*
            Select Operator
              expressions: a (type: double), b (type: double), c (type: int)
              outputColumnNames: a, b, c
              Statistics: Num rows: 55490383 Data size: 1831182639* Basic
stats: COMPLETE Column stats: NONE*
              Group By Operator
                aggregations: min(a), max(b), min(c)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 20 *Basic stats:
COMPLETE Column stats: NONE*
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 20 *Basic stats:
COMPLETE Column stats: NONE*
                  value expressions: _col0 (type: double), _col1 (type:
double), _col2 (type: int)
      Reduce Operator Tree:
        Group By Operator
          aggregations: min(VALUE._col0), max(VALUE._col1), min(VALUE._col2)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
Column stats: NONE
          Select Operator
            expressions: _col0 (type: double), _col1 (type: double), _col2
(type: int)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1


Thanks

Fwd: Column Stats with parquet

Posted by Suma Shivaprasad <su...@gmail.com>.

I am trying to enable Column statistics usage with Parquet tables. This is
the query I am executing. However on explain, I see that even though *Basic
stats: COMPLETE *is seen *Column stats *is seen as* NONE.*
Can someone please explain what else I need to debug/fix this.

set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;

analyze table user_table partition(dt='2014-06-01',hour='00') compute
statistics;

explain select min(a), max(b), min(c) from user_table;

hive> explain select min(a), max(b), min(c) from usertable;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: user_table
            Statistics: Num rows: 55490383 Data size: 1831182639 *Basic
stats: COMPLETE Column stats: NONE*
            Select Operator
              expressions: a (type: double), b (type: double), c (type: int)
              outputColumnNames: a, b, c
              Statistics: Num rows: 55490383 Data size: 1831182639* Basic
stats: COMPLETE Column stats: NONE*
              Group By Operator
                aggregations: min(a), max(b), min(c)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 20 *Basic stats:
COMPLETE Column stats: NONE*
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 20 *Basic stats:
COMPLETE Column stats: NONE*
                  value expressions: _col0 (type: double), _col1 (type:
double), _col2 (type: int)
      Reduce Operator Tree:
        Group By Operator
          aggregations: min(VALUE._col0), max(VALUE._col1), min(VALUE._col2)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
Column stats: NONE
          Select Operator
            expressions: _col0 (type: double), _col1 (type: double), _col2
(type: int)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1


Thanks