You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Igor Dvorzhak (Jira)" <ji...@apache.org> on 2020/02/14 17:14:00 UTC
[jira] [Updated] (HIVE-22894) Filter on subquery with GROUP BY
returns wrong column
[ https://issues.apache.org/jira/browse/HIVE-22894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Igor Dvorzhak updated HIVE-22894:
---------------------------------
Description:
Reproduction steps:
{code:java}
$ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - /user/hive/warehouse/test/data.csv
$ hive
> CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.877 seconds
> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A_B
C_A_C
Time taken: 9.108 seconds, Fetched: 2 row(s)
> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Plan optimized by CBO.Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_12]
Group By Operator [GBY_10] (rows=1 width=38)
Output:["_col0"],keys:_col0
Group By Operator [GBY_5] (rows=1 width=38)
Output:["_col0"],keys:KEY._col0
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:_col0
Group By Operator [GBY_3] (rows=1 width=38)
Output:["_col0"],keys:subcategory
Select Operator [SEL_2] (rows=1 width=38)
Output:["subcategory"]
Filter Operator [FIL_13] (rows=1 width=38)
predicate:((date_str = '02/11/20') and (category = 'C_A'))
TableScan [TS_0] (rows=1 width=38)
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.21 seconds, Fetched: 27 row(s)
{code}
It works as expected with disabled CBO:
{code:java}
> SET hive.cbo.enable=false;
> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A
Time taken: 13.948 seconds, Fetched: 1 row(s)
> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3
File Output Operator [FS_13]
Select Operator [SEL_12] (rows=1 width=38)
Output:["_col0"]
Group By Operator [GBY_11] (rows=1 width=38)
Output:["_col0"],keys:'C_A'
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_10]
PartitionCols:'C_A'
Group By Operator [GBY_9] (rows=1 width=38)
Output:["_col0"],keys:'C_A'
Select Operator [SEL_6] (rows=1 width=38)
Group By Operator [GBY_5] (rows=1 width=38)
Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', KEY._col2
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:'02/11/20', 'C_A', _col2
Group By Operator [GBY_3] (rows=1 width=38)
Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', subcategory
Select Operator [SEL_2] (rows=1 width=38)
Output:["subcategory"]
Filter Operator [FIL_14] (rows=1 width=38)
predicate:((date_str = '02/11/20') and (category = 'C_A'))
TableScan [TS_0] (rows=1 width=38)
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.065 seconds, Fetched: 34 row(s){code}
was:
Reproduction steps:
{code:java}
$ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - /user/hive/warehouse/test/data.csv
$ hive
> CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.877 seconds
> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test HERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A_B
C_A_C
Time taken: 9.108 seconds, Fetched: 2 row(s)
> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Plan optimized by CBO.Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_12]
Group By Operator [GBY_10] (rows=1 width=38)
Output:["_col0"],keys:_col0
Group By Operator [GBY_5] (rows=1 width=38)
Output:["_col0"],keys:KEY._col0
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:_col0
Group By Operator [GBY_3] (rows=1 width=38)
Output:["_col0"],keys:subcategory
Select Operator [SEL_2] (rows=1 width=38)
Output:["subcategory"]
Filter Operator [FIL_13] (rows=1 width=38)
predicate:((date_str = '02/11/20') and (category = 'C_A'))
TableScan [TS_0] (rows=1 width=38)
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.21 seconds, Fetched: 27 row(s)
{code}
It works as expected with disabled CBO:
{code:java}
> SET hive.cbo.enable=false;
> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A
Time taken: 13.948 seconds, Fetched: 1 row(s)
> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3
File Output Operator [FS_13]
Select Operator [SEL_12] (rows=1 width=38)
Output:["_col0"]
Group By Operator [GBY_11] (rows=1 width=38)
Output:["_col0"],keys:'C_A'
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_10]
PartitionCols:'C_A'
Group By Operator [GBY_9] (rows=1 width=38)
Output:["_col0"],keys:'C_A'
Select Operator [SEL_6] (rows=1 width=38)
Group By Operator [GBY_5] (rows=1 width=38)
Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', KEY._col2
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:'02/11/20', 'C_A', _col2
Group By Operator [GBY_3] (rows=1 width=38)
Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', subcategory
Select Operator [SEL_2] (rows=1 width=38)
Output:["subcategory"]
Filter Operator [FIL_14] (rows=1 width=38)
predicate:((date_str = '02/11/20') and (category = 'C_A'))
TableScan [TS_0] (rows=1 width=38)
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.065 seconds, Fetched: 34 row(s){code}
> Filter on subquery with GROUP BY returns wrong column
> -----------------------------------------------------
>
> Key: HIVE-22894
> URL: https://issues.apache.org/jira/browse/HIVE-22894
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Affects Versions: 2.3.6
> Reporter: Igor Dvorzhak
> Priority: Major
>
> Reproduction steps:
> {code:java}
> $ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - /user/hive/warehouse/test/data.csv
> $ hive
> > CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
> OK
> Time taken: 0.877 seconds
> > SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> C_A_B
> C_A_C
> Time taken: 9.108 seconds, Fetched: 2 row(s)
> > EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> Plan optimized by CBO.Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
> Fetch Operator
> limit:-1
> Stage-1
> Reducer 2
> File Output Operator [FS_12]
> Group By Operator [GBY_10] (rows=1 width=38)
> Output:["_col0"],keys:_col0
> Group By Operator [GBY_5] (rows=1 width=38)
> Output:["_col0"],keys:KEY._col0
> <-Map 1 [SIMPLE_EDGE]
> SHUFFLE [RS_4]
> PartitionCols:_col0
> Group By Operator [GBY_3] (rows=1 width=38)
> Output:["_col0"],keys:subcategory
> Select Operator [SEL_2] (rows=1 width=38)
> Output:["subcategory"]
> Filter Operator [FIL_13] (rows=1 width=38)
> predicate:((date_str = '02/11/20') and (category = 'C_A'))
> TableScan [TS_0] (rows=1 width=38)
> default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
> Time taken: 0.21 seconds, Fetched: 27 row(s)
> {code}
>
> It works as expected with disabled CBO:
> {code:java}
> > SET hive.cbo.enable=false;
> > SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> C_A
> Time taken: 13.948 seconds, Fetched: 1 row(s)
> > EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
> Fetch Operator
> limit:-1
> Stage-1
> Reducer 3
> File Output Operator [FS_13]
> Select Operator [SEL_12] (rows=1 width=38)
> Output:["_col0"]
> Group By Operator [GBY_11] (rows=1 width=38)
> Output:["_col0"],keys:'C_A'
> <-Reducer 2 [SIMPLE_EDGE]
> SHUFFLE [RS_10]
> PartitionCols:'C_A'
> Group By Operator [GBY_9] (rows=1 width=38)
> Output:["_col0"],keys:'C_A'
> Select Operator [SEL_6] (rows=1 width=38)
> Group By Operator [GBY_5] (rows=1 width=38)
> Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', KEY._col2
> <-Map 1 [SIMPLE_EDGE]
> SHUFFLE [RS_4]
> PartitionCols:'02/11/20', 'C_A', _col2
> Group By Operator [GBY_3] (rows=1 width=38)
> Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', subcategory
> Select Operator [SEL_2] (rows=1 width=38)
> Output:["subcategory"]
> Filter Operator [FIL_14] (rows=1 width=38)
> predicate:((date_str = '02/11/20') and (category = 'C_A'))
> TableScan [TS_0] (rows=1 width=38)
> default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
> Time taken: 0.065 seconds, Fetched: 34 row(s){code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)