You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/08/04 04:27:01 UTC
[jira] [Updated] (HIVE-21196) Support semijoin reduction on multiple column join

     [ https://issues.apache.org/jira/browse/HIVE-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated HIVE-21196:
----------------------------------
    Labels: pull-request-available  (was: )

> Support semijoin reduction on multiple column join
> --------------------------------------------------
>
>                 Key: HIVE-21196
>                 URL: https://issues.apache.org/jira/browse/HIVE-21196
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Deepak Jaiswal
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently for a query involving join on multiple columns creates  separate semi join edges for each key which in turn create a bloom filter for each of them, like below,
> EXPLAIN select count(*) from srcpart_date_n7 join srcpart_small_n3 on (srcpart_date_n7.key = srcpart_small_n3.key1 and srcpart_date_n7.value = srcpart_small_n3.value1)
> {code:java}
> Map 1 <- Reducer 5 (BROADCAST_EDGE)
>         Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
>         Reducer 5 <- Map 4 (CUSTOM_SIMPLE_EDGE)
> #### A masked pattern was here ####
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: srcpart_date_n7
>                   filterExpr: (key is not null and value is not null and (key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter)))) (type: boolean)
>                   Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter))) and key is not null and value is not null) (type: boolean)
>                     Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: key (type: string), value (type: string)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: string), _col1 (type: string)
>                         sort order: ++
>                         Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
>                         Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>         Map 4 
>             Map Operator Tree:
>                 TableScan
>                   alias: srcpart_small_n3
>                   filterExpr: (key1 is not null and value1 is not null) (type: boolean)
>                   Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
>                   Filter Operator
>                     predicate: (key1 is not null and value1 is not null) (type: boolean)
>                     Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
>                     Select Operator
>                       expressions: key1 (type: string), value1 (type: string)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
>                       Reduce Output Operator
>                         key expressions: _col0 (type: string), _col1 (type: string)
>                         sort order: ++
>                         Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
>                         Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
>                       Select Operator
>                         expressions: _col0 (type: string)
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
>                         Group By Operator
>                           aggregations: min(_col0), max(_col0), bloom_filter(_col0, expectedEntries=20)
>                           mode: hash
>                           outputColumnNames: _col0, _col1, _col2
>                           Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
>                           Reduce Output Operator
>                             sort order: 
>                             Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
>                             value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary)
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>         Reducer 2 
>             Execution mode: llap
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 keys:
>                   0 _col0 (type: string), _col1 (type: string)
>                   1 _col0 (type: string), _col1 (type: string)
>                 Statistics: Num rows: 2200 Data size: 391600 Basic stats: PARTIAL Column stats: NONE
>                 Group By Operator
>                   aggregations: count()
>                   mode: hash
>                   outputColumnNames: _col0
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
>                   Reduce Output Operator
>                     sort order: 
>                     Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
>                     value expressions: _col0 (type: bigint)
>         Reducer 3 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
>                   table:
>                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 5 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), bloom_filter(VALUE._col2, expectedEntries=20)
>                 mode: final
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
>                 Reduce Output Operator
>                   sort order: 
>                   Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
>                   value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary)
> {code}
> Instead it should create one branch for a join with one bloom filter.
>  
> The implementation for bloom filter requires getting a hash out of all the key columns and converting it to a long and feeding it to bloom filter as input. This requires a new UDF which does this. It will be called at both bloom filter generation and lookup phases.
> The min and max will stay independent as they are today for each columns.
> A vectorized implementation of such UDF is also required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)