You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by poornima kumarasinghe <po...@gmail.com> on 2016/12/30 15:01:45 UTC
Bitmap index implementation in Hive
Hi all,
I am researching on the effect of bitmap indexes on queries and I happened
to come across bitmap indexes on Hive. But I have a few areas that I need
help in understanding how the bitmaps are implemented and used in hive.
1) The bitmap column of the index table is always the same value
([1,2,4,8589934592,1,0]). How can the bitmap of two different values be the
same?
2) The bitmap index used query and the one that didn't use bitmap indexes
gave the following results for their explain statement. The blue colored
sections are the same steps. According to this explain statements, the
bitmap index usage part seems redundant. Can you please help me to clear
out these issues?
Regards,
Samadhi Poornima.
hive> EXPLAIN SELECT count(*) FROM truck_mileage_stage WHERE truckid =
"A30" AND driverid="A30";
OK
STAGE DEPENDENCIES:
Stage-7 is a root stage
Stage-4 depends on stages: Stage-7
Stage-2 depends on stages: Stage-4
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-7
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_0:default__truck_mileage_stage_bittruck_index__
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_0:default__truck_mileage_stage_bittruck_index__
TableScan
alias: default__truck_mileage_stage_bittruck_index__
filterExpr: ((truckid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
Filter Operator
predicate: ((truckid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
Select Operator
expressions: _bucketname (type: string), _offset (type:
bigint), _bitmaps (type: array<bigint>)
outputColumnNames: _col0, _col1, _col2
HashTable Sink Operator
keys:
0 _col0 (type: string), _col1 (type: bigint)
1 _col0 (type: string), _col1 (type: bigint)
Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: default__truck_mileage_stage_bitdriver_index__
filterExpr: ((driverid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
Filter Operator
predicate: ((driverid = 'A30') and _bucketname is not null
and _offset is not null) (type: boolean)
Select Operator
expressions: _bucketname (type: string), _offset (type:
bigint), _bitmaps (type: array<bigint>)
outputColumnNames: _col0, _col1, _col2
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string), _col1 (type: bigint)
1 _col0 (type: string), _col1 (type: bigint)
outputColumnNames: _col0, _col2, _col4, _col5
Filter Operator
predicate: (not
EWAH_BITMAP_EMPTY(EWAH_BITMAP_AND(_col2,_col5))) (type: boolean)
Select Operator
expressions: _col0 (type: string), _col4 (type:
bigint)
outputColumnNames: _col0, _col4
Group By Operator
aggregations: collect_set(_col4)
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
value expressions: _col1 (type: array<bigint>)
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: collect_set(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Move Operator
files:
hdfs directory: true
destination:
file:/tmp/samadhik/392bef0d-0e26-448f-bdce-7621965b1ad0/hive_2016-12-30_09-56-12_455_2464238236026798959-1/-mr-10004
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: truck_mileage_stage
filterExpr: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
Statistics: Num rows: 1148 Data size: 229744 Basic stats:
COMPLETE Column stats: NONE
Filter Operator
predicate: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
hive> explain SELECT count(*) FROM truck_mileage_stage WHERE truckid =
"A30" AND driverid="A30";
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: truck_mileage_stage
filterExpr: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
Statistics: Num rows: 1148 Data size: 229744 Basic stats:
COMPLETE Column stats: NONE
Filter Operator
predicate: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink