You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Igor Dvorzhak (Jira)" <ji...@apache.org> on 2020/02/14 17:14:00 UTC

[jira] [Updated] (HIVE-22894) Filter on subquery with GROUP BY returns wrong column

     [ https://issues.apache.org/jira/browse/HIVE-22894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Dvorzhak updated HIVE-22894:
---------------------------------
    Description: 
Reproduction steps:
{code:java}
$ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - /user/hive/warehouse/test/data.csv

$ hive

> CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.877 seconds

> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A_B
C_A_C
Time taken: 9.108 seconds, Fetched: 2 row(s)

> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Plan optimized by CBO.Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 2
      File Output Operator [FS_12]
        Group By Operator [GBY_10] (rows=1 width=38)
          Output:["_col0"],keys:_col0
          Group By Operator [GBY_5] (rows=1 width=38)
            Output:["_col0"],keys:KEY._col0
          <-Map 1 [SIMPLE_EDGE]
            SHUFFLE [RS_4]
              PartitionCols:_col0
              Group By Operator [GBY_3] (rows=1 width=38)
                Output:["_col0"],keys:subcategory
                Select Operator [SEL_2] (rows=1 width=38)
                  Output:["subcategory"]
                  Filter Operator [FIL_13] (rows=1 width=38)
                    predicate:((date_str = '02/11/20') and (category = 'C_A'))
                    TableScan [TS_0] (rows=1 width=38)
                      default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.21 seconds, Fetched: 27 row(s)
{code}
 

It works as expected with disabled CBO:
{code:java}
> SET hive.cbo.enable=false;

> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A
Time taken: 13.948 seconds, Fetched: 1 row(s)

> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_13]
        Select Operator [SEL_12] (rows=1 width=38)
          Output:["_col0"]
          Group By Operator [GBY_11] (rows=1 width=38)
            Output:["_col0"],keys:'C_A'
          <-Reducer 2 [SIMPLE_EDGE]
            SHUFFLE [RS_10]
              PartitionCols:'C_A'
              Group By Operator [GBY_9] (rows=1 width=38)
                Output:["_col0"],keys:'C_A'
                Select Operator [SEL_6] (rows=1 width=38)
                  Group By Operator [GBY_5] (rows=1 width=38)
                    Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', KEY._col2
                  <-Map 1 [SIMPLE_EDGE]
                    SHUFFLE [RS_4]
                      PartitionCols:'02/11/20', 'C_A', _col2
                      Group By Operator [GBY_3] (rows=1 width=38)
                        Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', subcategory
                        Select Operator [SEL_2] (rows=1 width=38)
                          Output:["subcategory"]
                          Filter Operator [FIL_14] (rows=1 width=38)
                            predicate:((date_str = '02/11/20') and (category = 'C_A'))
                            TableScan [TS_0] (rows=1 width=38)
                              default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.065 seconds, Fetched: 34 row(s){code}
 

 

  was:
Reproduction steps:
{code:java}
$ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - /user/hive/warehouse/test/data.csv

$ hive

> CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.877 seconds

> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test HERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A_B
C_A_C
Time taken: 9.108 seconds, Fetched: 2 row(s)

> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Plan optimized by CBO.Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 2
      File Output Operator [FS_12]
        Group By Operator [GBY_10] (rows=1 width=38)
          Output:["_col0"],keys:_col0
          Group By Operator [GBY_5] (rows=1 width=38)
            Output:["_col0"],keys:KEY._col0
          <-Map 1 [SIMPLE_EDGE]
            SHUFFLE [RS_4]
              PartitionCols:_col0
              Group By Operator [GBY_3] (rows=1 width=38)
                Output:["_col0"],keys:subcategory
                Select Operator [SEL_2] (rows=1 width=38)
                  Output:["subcategory"]
                  Filter Operator [FIL_13] (rows=1 width=38)
                    predicate:((date_str = '02/11/20') and (category = 'C_A'))
                    TableScan [TS_0] (rows=1 width=38)
                      default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.21 seconds, Fetched: 27 row(s)
{code}
 

It works as expected with disabled CBO:

{code:java}
> SET hive.cbo.enable=false;

> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
C_A
Time taken: 13.948 seconds, Fetched: 1 row(s)

> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_13]
        Select Operator [SEL_12] (rows=1 width=38)
          Output:["_col0"]
          Group By Operator [GBY_11] (rows=1 width=38)
            Output:["_col0"],keys:'C_A'
          <-Reducer 2 [SIMPLE_EDGE]
            SHUFFLE [RS_10]
              PartitionCols:'C_A'
              Group By Operator [GBY_9] (rows=1 width=38)
                Output:["_col0"],keys:'C_A'
                Select Operator [SEL_6] (rows=1 width=38)
                  Group By Operator [GBY_5] (rows=1 width=38)
                    Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', KEY._col2
                  <-Map 1 [SIMPLE_EDGE]
                    SHUFFLE [RS_4]
                      PartitionCols:'02/11/20', 'C_A', _col2
                      Group By Operator [GBY_3] (rows=1 width=38)
                        Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', subcategory
                        Select Operator [SEL_2] (rows=1 width=38)
                          Output:["subcategory"]
                          Filter Operator [FIL_14] (rows=1 width=38)
                            predicate:((date_str = '02/11/20') and (category = 'C_A'))
                            TableScan [TS_0] (rows=1 width=38)
                              default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.065 seconds, Fetched: 34 row(s){code}
 

 


> Filter on subquery with GROUP BY returns wrong column
> -----------------------------------------------------
>
>                 Key: HIVE-22894
>                 URL: https://issues.apache.org/jira/browse/HIVE-22894
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 2.3.6
>            Reporter: Igor Dvorzhak
>            Priority: Major
>
> Reproduction steps:
> {code:java}
> $ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - /user/hive/warehouse/test/data.csv
> $ hive
> > CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
> OK
> Time taken: 0.877 seconds
> > SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> C_A_B
> C_A_C
> Time taken: 9.108 seconds, Fetched: 2 row(s)
> > EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> Plan optimized by CBO.Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Reducer 2
>       File Output Operator [FS_12]
>         Group By Operator [GBY_10] (rows=1 width=38)
>           Output:["_col0"],keys:_col0
>           Group By Operator [GBY_5] (rows=1 width=38)
>             Output:["_col0"],keys:KEY._col0
>           <-Map 1 [SIMPLE_EDGE]
>             SHUFFLE [RS_4]
>               PartitionCols:_col0
>               Group By Operator [GBY_3] (rows=1 width=38)
>                 Output:["_col0"],keys:subcategory
>                 Select Operator [SEL_2] (rows=1 width=38)
>                   Output:["subcategory"]
>                   Filter Operator [FIL_13] (rows=1 width=38)
>                     predicate:((date_str = '02/11/20') and (category = 'C_A'))
>                     TableScan [TS_0] (rows=1 width=38)
>                       default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
> Time taken: 0.21 seconds, Fetched: 27 row(s)
> {code}
>  
> It works as expected with disabled CBO:
> {code:java}
> > SET hive.cbo.enable=false;
> > SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> C_A
> Time taken: 13.948 seconds, Fetched: 1 row(s)
> > EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t WHERE t.category='C_A';
> OK
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Reducer 3
>       File Output Operator [FS_13]
>         Select Operator [SEL_12] (rows=1 width=38)
>           Output:["_col0"]
>           Group By Operator [GBY_11] (rows=1 width=38)
>             Output:["_col0"],keys:'C_A'
>           <-Reducer 2 [SIMPLE_EDGE]
>             SHUFFLE [RS_10]
>               PartitionCols:'C_A'
>               Group By Operator [GBY_9] (rows=1 width=38)
>                 Output:["_col0"],keys:'C_A'
>                 Select Operator [SEL_6] (rows=1 width=38)
>                   Group By Operator [GBY_5] (rows=1 width=38)
>                     Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', KEY._col2
>                   <-Map 1 [SIMPLE_EDGE]
>                     SHUFFLE [RS_4]
>                       PartitionCols:'02/11/20', 'C_A', _col2
>                       Group By Operator [GBY_3] (rows=1 width=38)
>                         Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', subcategory
>                         Select Operator [SEL_2] (rows=1 width=38)
>                           Output:["subcategory"]
>                           Filter Operator [FIL_14] (rows=1 width=38)
>                             predicate:((date_str = '02/11/20') and (category = 'C_A'))
>                             TableScan [TS_0] (rows=1 width=38)
>                               default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
> Time taken: 0.065 seconds, Fetched: 34 row(s){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)