You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/08/04 04:27:01 UTC
[jira] [Updated] (HIVE-21196) Support semijoin reduction on
multiple column join
[ https://issues.apache.org/jira/browse/HIVE-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HIVE-21196:
----------------------------------
Labels: pull-request-available (was: )
> Support semijoin reduction on multiple column join
> --------------------------------------------------
>
> Key: HIVE-21196
> URL: https://issues.apache.org/jira/browse/HIVE-21196
> Project: Hive
> Issue Type: Bug
> Reporter: Deepak Jaiswal
> Assignee: Stamatis Zampetakis
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently for a query involving join on multiple columns creates separate semi join edges for each key which in turn create a bloom filter for each of them, like below,
> EXPLAIN select count(*) from srcpart_date_n7 join srcpart_small_n3 on (srcpart_date_n7.key = srcpart_small_n3.key1 and srcpart_date_n7.value = srcpart_small_n3.value1)
> {code:java}
> Map 1 <- Reducer 5 (BROADCAST_EDGE)
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
> Reducer 5 <- Map 4 (CUSTOM_SIMPLE_EDGE)
> #### A masked pattern was here ####
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: srcpart_date_n7
> filterExpr: (key is not null and value is not null and (key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter)))) (type: boolean)
> Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: ((key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter))) and key is not null and value is not null) (type: boolean)
> Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: key (type: string), value (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: string), _col1 (type: string)
> sort order: ++
> Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
> Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
> Execution mode: vectorized, llap
> LLAP IO: all inputs
> Map 4
> Map Operator Tree:
> TableScan
> alias: srcpart_small_n3
> filterExpr: (key1 is not null and value1 is not null) (type: boolean)
> Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
> Filter Operator
> predicate: (key1 is not null and value1 is not null) (type: boolean)
> Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
> Select Operator
> expressions: key1 (type: string), value1 (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
> Reduce Output Operator
> key expressions: _col0 (type: string), _col1 (type: string)
> sort order: ++
> Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
> Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
> Select Operator
> expressions: _col0 (type: string)
> outputColumnNames: _col0
> Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
> Group By Operator
> aggregations: min(_col0), max(_col0), bloom_filter(_col0, expectedEntries=20)
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
> value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary)
> Execution mode: vectorized, llap
> LLAP IO: all inputs
> Reducer 2
> Execution mode: llap
> Reduce Operator Tree:
> Merge Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col0 (type: string), _col1 (type: string)
> 1 _col0 (type: string), _col1 (type: string)
> Statistics: Num rows: 2200 Data size: 391600 Basic stats: PARTIAL Column stats: NONE
> Group By Operator
> aggregations: count()
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
> value expressions: _col0 (type: bigint)
> Reducer 3
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Reducer 5
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: min(VALUE._col0), max(VALUE._col1), bloom_filter(VALUE._col2, expectedEntries=20)
> mode: final
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
> value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary)
> {code}
> Instead it should create one branch for a join with one bloom filter.
>
> The implementation for bloom filter requires getting a hash out of all the key columns and converting it to a long and feeding it to bloom filter as input. This requires a new UDF which does this. It will be called at both bloom filter generation and lookup phases.
> The min and max will stay independent as they are today for each columns.
> A vectorized implementation of such UDF is also required.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)