You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alexander Pivovarov (JIRA)" <ji...@apache.org> on 2015/06/01 23:25:17 UTC
[jira] [Commented] (HIVE-10841) [WHERE col is not null] does not work sometimes for queries with many JOIN statements

    [ https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568042#comment-14568042 ] 

Alexander Pivovarov commented on HIVE-10841:
--------------------------------------------

I looked at the query plans for both cases - remove.duplicatefilters false and true
if remove.duplicatefilters is false then
- Filter Operator predicates are added to Reduce step (including the predicates for "acct" table)
- But Filter Operator predicates still not added to Map step for "acct" table.

I did some performance tests - looks like duplicating Filter Operator predicates to Reduce step does not affect the performance noticeably.
So, disabling hive.ppd.remove.duplicatefilters can be used as a good workaround until we find a way on how to fix query plan to include Filter Operator predicates on Map step for "acct" table.



> [WHERE col is not null] does not work sometimes for queries with many JOIN statements
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-10841
>                 URL: https://issues.apache.org/jira/browse/HIVE-10841
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning, Query Processor
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.2.0
>            Reporter: Alexander Pivovarov
>            Assignee: Alexander Pivovarov
>
> The result from the following SELECT query is 3 rows but it should be 1 row.
> I checked it in MySQL - it returned 1 row.
> To reproduce the issue in Hive
> 1. prepare tables
> {code}
> drop table if exists L;
> drop table if exists LA;
> drop table if exists FR;
> drop table if exists A;
> drop table if exists PI;
> drop table if exists acct;
> create table L as select 4436 id;
> create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
> create table FR as select 4436 loan_id;
> create table A as select 4748 id;
> create table PI as select 4415 id;
> create table acct as select 4748 aid, 10 acc_n, 122 brn;
> insert into table acct values(4748, null, null);
> insert into table acct values(4748, null, null);
> {code}
> 2. run SELECT query
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid
> WHERE
>   L.id = 4436
>   and acct.brn is not null;
> {code}
> the result is 3 rows
> {code}
> 10	122
> NULL	NULL
> NULL	NULL
> {code}
> but it should be 1 row
> {code}
> 10	122
> {code}
> 2.1 "explain select ..." output for hive-1.3.0 MR
> {code}
> STAGE DEPENDENCIES:
>   Stage-12 is a root stage
>   Stage-9 depends on stages: Stage-12
>   Stage-0 depends on stages: Stage-9
> STAGE PLANS:
>   Stage: Stage-12
>     Map Reduce Local Work
>       Alias -> Map Local Tables:
>         a 
>           Fetch Operator
>             limit: -1
>         acct 
>           Fetch Operator
>             limit: -1
>         fr 
>           Fetch Operator
>             limit: -1
>         l 
>           Fetch Operator
>             limit: -1
>         pi 
>           Fetch Operator
>             limit: -1
>       Alias -> Map Local Operator Tree:
>         a 
>           TableScan
>             alias: a
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         acct 
>           TableScan
>             alias: acct
>             Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column stats: NONE
>             Filter Operator
>               predicate: aid is not null (type: boolean)
>               Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         fr 
>           TableScan
>             alias: fr
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>             Filter Operator
>               predicate: (loan_id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         l 
>           TableScan
>             alias: l
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>             Filter Operator
>               predicate: (id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         pi 
>           TableScan
>             alias: pi
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col6 (type: int)
>                   1 id (type: int)
>   Stage: Stage-9
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: la
>             Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
>             Filter Operator
>               predicate: (((loan_id is not null and aid is not null) and pi_id is not null) and (loan_id = 4436)) (type: boolean)
>               Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
>               Map Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 0 to 2
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>                 outputColumnNames: _col5, _col6
>                 Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: NONE
>                 Map Join Operator
>                   condition map:
>                        Inner Join 0 to 1
>                        Inner Join 1 to 2
>                   keys:
>                     0 _col5 (type: int)
>                     1 id (type: int)
>                     2 aid (type: int)
>                   outputColumnNames: _col6, _col19, _col20
>                   Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE Column stats: NONE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0 _col6 (type: int)
>                       1 id (type: int)
>                     outputColumnNames: _col19, _col20
>                     Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: _col19 (type: int), _col20 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
>                       File Output Operator
>                         compressed: false
>                         Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
>                         table:
>                             input format: org.apache.hadoop.mapred.TextInputFormat
>                             output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                             serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>       Local Work:
>         Map Reduce Local Work
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> Time taken: 0.57 seconds, Fetched: 142 row(s)
> {code}
> 2.2. "explain select..." output for hive-0.13.1 Tez
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)
>         Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8 (SIMPLE_EDGE)
>       DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: acct
>                   Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: aid (type: int)
>                     sort order: +
>                     Map-reduce partition columns: aid (type: int)
>                     Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
>                     value expressions: acc_n (type: int), brn (type: int)
>         Map 4 
>             Map Operator Tree:
>                 TableScan
>                   alias: a
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: int)
>                     sort order: +
>                     Map-reduce partition columns: id (type: int)
>                     Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
>         Map 5 
>             Map Operator Tree:
>                 TableScan
>                   alias: la
>                   Statistics: Num rows: 28 Data size: 347 Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (loan_id = 4436) (type: boolean)
>                     Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: loan_id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: loan_id (type: int)
>                       Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
>                       value expressions: aid (type: int), pi_id (type: int)
>         Map 7 
>             Map Operator Tree:
>                 TableScan
>                   alias: fr
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (loan_id = 4436) (type: boolean)
>                     Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: loan_id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: loan_id (type: int)
>                       Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
>         Map 8 
>             Map Operator Tree:
>                 TableScan
>                   alias: l
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (id = 4436) (type: boolean)
>                     Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: id (type: int)
>                       Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
>         Map 9 
>             Map Operator Tree:
>                 TableScan
>                   alias: pi
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: int)
>                     sort order: +
>                     Map-reduce partition columns: id (type: int)
>                     Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
>         Reducer 2 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 1 to 2
>                 condition expressions:
>                   0 {VALUE._col2}
>                   1 
>                   2 {VALUE._col1} {VALUE._col2}
>                 outputColumnNames: _col2, _col15, _col16
>                 Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col2 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col2 (type: int)
>                   Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
>                   value expressions: _col15 (type: int), _col16 (type: int)
>         Reducer 3 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 condition expressions:
>                   0 {VALUE._col1} {VALUE._col2}
>                   1 
>                 outputColumnNames: _col1, _col2
>                 Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
>                 Select Operator
>                   expressions: _col1 (type: int), _col2 (type: int)
>                   outputColumnNames: _col0, _col1
>                   Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.TextInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 6 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 0 to 2
>                 condition expressions:
>                   0 
>                   1 {VALUE._col1} {VALUE._col2}
>                   2 
>                 outputColumnNames: _col4, _col5
>                 Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col4 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col4 (type: int)
>                   Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
>                   value expressions: _col5 (type: int)
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> Time taken: 1.377 seconds, Fetched: 146 row(s)
> {code}
> 3. The workaround is to put "acct.brn is not null" to join condition
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid and acct.brn is not null
> WHERE
>   L.id = 4436;
> OK
> 10	122
> Time taken: 23.479 seconds, Fetched: 1 row(s)
> {code}
> I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations have the issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)