You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alexander Pivovarov (JIRA)" <ji...@apache.org> on 2015/06/01 23:25:17 UTC
[jira] [Commented] (HIVE-10841) [WHERE col is not null] does not
work sometimes for queries with many JOIN statements
[ https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568042#comment-14568042 ]
Alexander Pivovarov commented on HIVE-10841:
--------------------------------------------
I looked at the query plans for both cases - remove.duplicatefilters false and true
if remove.duplicatefilters is false then
- Filter Operator predicates are added to Reduce step (including the predicates for "acct" table)
- But Filter Operator predicates still not added to Map step for "acct" table.
I did some performance tests - looks like duplicating Filter Operator predicates to Reduce step does not affect the performance noticeably.
So, disabling hive.ppd.remove.duplicatefilters can be used as a good workaround until we find a way on how to fix query plan to include Filter Operator predicates on Map step for "acct" table.
> [WHERE col is not null] does not work sometimes for queries with many JOIN statements
> -------------------------------------------------------------------------------------
>
> Key: HIVE-10841
> URL: https://issues.apache.org/jira/browse/HIVE-10841
> Project: Hive
> Issue Type: Bug
> Components: Query Planning, Query Processor
> Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.2.0
> Reporter: Alexander Pivovarov
> Assignee: Alexander Pivovarov
>
> The result from the following SELECT query is 3 rows but it should be 1 row.
> I checked it in MySQL - it returned 1 row.
> To reproduce the issue in Hive
> 1. prepare tables
> {code}
> drop table if exists L;
> drop table if exists LA;
> drop table if exists FR;
> drop table if exists A;
> drop table if exists PI;
> drop table if exists acct;
> create table L as select 4436 id;
> create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
> create table FR as select 4436 loan_id;
> create table A as select 4748 id;
> create table PI as select 4415 id;
> create table acct as select 4748 aid, 10 acc_n, 122 brn;
> insert into table acct values(4748, null, null);
> insert into table acct values(4748, null, null);
> {code}
> 2. run SELECT query
> {code}
> select
> acct.ACC_N,
> acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid
> WHERE
> L.id = 4436
> and acct.brn is not null;
> {code}
> the result is 3 rows
> {code}
> 10 122
> NULL NULL
> NULL NULL
> {code}
> but it should be 1 row
> {code}
> 10 122
> {code}
> 2.1 "explain select ..." output for hive-1.3.0 MR
> {code}
> STAGE DEPENDENCIES:
> Stage-12 is a root stage
> Stage-9 depends on stages: Stage-12
> Stage-0 depends on stages: Stage-9
> STAGE PLANS:
> Stage: Stage-12
> Map Reduce Local Work
> Alias -> Map Local Tables:
> a
> Fetch Operator
> limit: -1
> acct
> Fetch Operator
> limit: -1
> fr
> Fetch Operator
> limit: -1
> l
> Fetch Operator
> limit: -1
> pi
> Fetch Operator
> limit: -1
> Alias -> Map Local Operator Tree:
> a
> TableScan
> alias: a
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: id is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> acct
> TableScan
> alias: acct
> Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: aid is not null (type: boolean)
> Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> fr
> TableScan
> alias: fr
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (loan_id = 4436) (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> l
> TableScan
> alias: l
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (id = 4436) (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> pi
> TableScan
> alias: pi
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: id is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col6 (type: int)
> 1 id (type: int)
> Stage: Stage-9
> Map Reduce
> Map Operator Tree:
> TableScan
> alias: la
> Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (((loan_id is not null and aid is not null) and pi_id is not null) and (loan_id = 4436)) (type: boolean)
> Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 0 to 2
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> outputColumnNames: _col5, _col6
> Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 1 to 2
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> outputColumnNames: _col6, _col19, _col20
> Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col6 (type: int)
> 1 id (type: int)
> outputColumnNames: _col19, _col20
> Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: _col19 (type: int), _col20 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Local Work:
> Map Reduce Local Work
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> Time taken: 0.57 seconds, Fetched: 142 row(s)
> {code}
> 2.2. "explain select..." output for hive-0.13.1 Tez
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Tez
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)
> Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8 (SIMPLE_EDGE)
> DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: acct
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: aid (type: int)
> sort order: +
> Map-reduce partition columns: aid (type: int)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
> value expressions: acc_n (type: int), brn (type: int)
> Map 4
> Map Operator Tree:
> TableScan
> alias: a
> Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: id (type: int)
> sort order: +
> Map-reduce partition columns: id (type: int)
> Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
> Map 5
> Map Operator Tree:
> TableScan
> alias: la
> Statistics: Num rows: 28 Data size: 347 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (loan_id = 4436) (type: boolean)
> Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: loan_id (type: int)
> sort order: +
> Map-reduce partition columns: loan_id (type: int)
> Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
> value expressions: aid (type: int), pi_id (type: int)
> Map 7
> Map Operator Tree:
> TableScan
> alias: fr
> Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (loan_id = 4436) (type: boolean)
> Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: loan_id (type: int)
> sort order: +
> Map-reduce partition columns: loan_id (type: int)
> Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
> Map 8
> Map Operator Tree:
> TableScan
> alias: l
> Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (id = 4436) (type: boolean)
> Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: id (type: int)
> sort order: +
> Map-reduce partition columns: id (type: int)
> Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
> Map 9
> Map Operator Tree:
> TableScan
> alias: pi
> Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: id (type: int)
> sort order: +
> Map-reduce partition columns: id (type: int)
> Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
> Reducer 2
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 1 to 2
> condition expressions:
> 0 {VALUE._col2}
> 1
> 2 {VALUE._col1} {VALUE._col2}
> outputColumnNames: _col2, _col15, _col16
> Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col2 (type: int)
> sort order: +
> Map-reduce partition columns: _col2 (type: int)
> Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col15 (type: int), _col16 (type: int)
> Reducer 3
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {VALUE._col1} {VALUE._col2}
> 1
> outputColumnNames: _col1, _col2
> Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: _col1 (type: int), _col2 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Reducer 6
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 0 to 2
> condition expressions:
> 0
> 1 {VALUE._col1} {VALUE._col2}
> 2
> outputColumnNames: _col4, _col5
> Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col4 (type: int)
> sort order: +
> Map-reduce partition columns: _col4 (type: int)
> Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col5 (type: int)
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Time taken: 1.377 seconds, Fetched: 146 row(s)
> {code}
> 3. The workaround is to put "acct.brn is not null" to join condition
> {code}
> select
> acct.ACC_N,
> acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid and acct.brn is not null
> WHERE
> L.id = 4436;
> OK
> 10 122
> Time taken: 23.479 seconds, Fetched: 1 row(s)
> {code}
> I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations have the issue
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)