You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Heller, Chris" <ch...@akamai.com> on 2013/11/06 15:02:52 UTC
Help me understand Hive indexing.
Hi,
I am new to Hive, and am trying to setup an index on a Hive table to
improve query performance.
I am presently using the CDH 4.2 Hadoop distribution, which ships with
Hive 0.10, so from what I have read table index support should be
available.
What I am seeing though is that when I go and create a simple test table,
the time to perform a query on this table doesn't change whether I have an
index created or not.
Now I am likely just misunderstanding how to make use of the Hive index,
so I am hoping to get some expert advice on what I must do to make use of
the index.
I have included a sample CLI session that documents what I have tried so
far.
Thank You,
Chris
==== BEGIN HIVE EXAMPLE ====
hive> create table foo stored as RCFILE as select uri from uris;
hive> explain select count(*) from foo where uri='http://www.example.com/';
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME foo))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (= (TOK_TABLE_OR_COL uri)
'http://www.example.com/'))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
foo
TableScan
alias: foo
Filter Operator
predicate:
expr: (uri = 'http://www.example.com/')
type: boolean
Select Operator
Group By Operator
aggregations:
expr: count()
bucketGroup: false
mode: hash
outputColumnNames: _col0
Reduce Output Operator
sort order:
tag: -1
value expressions:
expr: _col0
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
hive> select count(*) from foo where uri='http://www.example.com/';
MapReduce Total cumulative CPU time: 6 minutes 13 seconds 270 msec
Ended Job = job_201311060137_0028
MapReduce Jobs Launched:
Job 0: Map: 17 Reduce: 1 Cumulative CPU: 373.27 sec HDFS Read:
3801117400 HDFS Write: 17 SUCCESS
Total MapReduce CPU Time Spent: 6 minutes 13 seconds 270 msec
OK
11044586
Time taken: 42.107 seconds
hive> create index foo_idx on table foo (uri) as 'COMPACT' with deferred
rebuild;
hive> alter index foo_idx on foo rebuild;
hive> set hive.optimize.index.filter = true;
hive> explain select count(*) from foo where uri='http://www.example.com/';
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME foo))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (= (TOK_TABLE_OR_COL uri)
'http://www.example.com/'))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
foo
TableScan
alias: foo
filterExpr:
expr: (uri = 'http://www.example.com/')
type: boolean
Filter Operator
predicate:
expr: (uri = 'http://www.example.com/')
type: boolean
Select Operator
Group By Operator
aggregations:
expr: count()
bucketGroup: false
mode: hash
outputColumnNames: _col0
Reduce Output Operator
sort order:
tag: -1
value expressions:
expr: _col0
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
hive> select count(*) from foo where uri='http://www.example.com/';
MapReduce Total cumulative CPU time: 6 minutes 4 seconds 730 msec
Ended Job = job_201311060137_0031
MapReduce Jobs Launched:
Job 0: Map: 17 Reduce: 1 Cumulative CPU: 364.73 sec HDFS Read:
3801117400 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 6 minutes 4 seconds 730 msec
OK
0
Time taken: 42.112 seconds
==== END HIVE EXAMPLE ====