You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Eugene Chung (Jira)" <ji...@apache.org> on 2020/08/04 11:52:00 UTC
[jira] [Comment Edited] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

    [ https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170730#comment-17170730 ] 

Eugene Chung edited comment on HIVE-23954 at 8/4/20, 11:51 AM:
---------------------------------------------------------------

[~kgyrtkirk] Yes, it fixes with set hive.optimize.reducededuplication=false.
{code:java}
set hive.optimize.reducededuplication=false;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3                                    |
|       File Output Operator [FS_7]                  |
|         Group By Operator [GBY_14] (rows=1 width=16) |
|           Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(VALUE._col1)"] |
|         <-Reducer 2 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_13]           |
|             Group By Operator [GBY_12] (rows=1 width=16) |
|               Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
|               Group By Operator [GBY_11] (rows=343640771 width=4160) |
|                 Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|               <-Map 1 [SIMPLE_EDGE]                |
|                 SHUFFLE [RS_10]                    |
|                   PartitionCols:_col0              |
|                   Group By Operator [GBY_9] (rows=343640771 width=4160) |
|                     Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
|                     Select Operator [SEL_2] (rows=343640771 width=4160) |
|                       Output:["mid"]               |
|                       TableScan [TS_0] (rows=343640771 width=4160) |
|                         db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
|                                                    |
+----------------------------------------------------+
{code}


was (Author: euigeun_chung):
[~kgyrtkirk] Yes, with set hive.optimize.reducededuplication=false.
{code:java}
set hive.optimize.reducededuplication=false;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3                                    |
|       File Output Operator [FS_7]                  |
|         Group By Operator [GBY_14] (rows=1 width=16) |
|           Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(VALUE._col1)"] |
|         <-Reducer 2 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_13]           |
|             Group By Operator [GBY_12] (rows=1 width=16) |
|               Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
|               Group By Operator [GBY_11] (rows=343640771 width=4160) |
|                 Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|               <-Map 1 [SIMPLE_EDGE]                |
|                 SHUFFLE [RS_10]                    |
|                   PartitionCols:_col0              |
|                   Group By Operator [GBY_9] (rows=343640771 width=4160) |
|                     Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
|                     Select Operator [SEL_2] (rows=343640771 width=4160) |
|                       Output:["mid"]               |
|                       TableScan [TS_0] (rows=343640771 width=4160) |
|                         db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
|                                                    |
+----------------------------------------------------+
{code}

> count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-23954
>                 URL: https://issues.apache.org/jira/browse/HIVE-23954
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Eugene Chung
>            Priority: Major
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 2                                    |
> |       File Output Operator [FS_7]                  |
> |         Group By Operator [GBY_5] (rows=1 width=24) |
> |           Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT KEY._col0:0._col0)"] |
> |         <-Map 1 [SIMPLE_EDGE]                      |
> |           SHUFFLE [RS_4]                           |
> |             Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |               Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT mid)"],keys:mid |
> |               Select Operator [SEL_2] (rows=343640771 width=4160) |
> |                 Output:["mid"]                     |
> |                 TableScan [TS_0] (rows=343640771 width=4160) |
> |                   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> |                                                    |
> +----------------------------------------------------+{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 2                                    |
> |       File Output Operator [FS_7]                  |
> |         Group By Operator [GBY_14] (rows=1 width=16) |
> |           Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
> |           Group By Operator [GBY_11] (rows=343640771 width=4160) |
> |             Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
> |           <-Map 1 [SIMPLE_EDGE]                    |
> |             SHUFFLE [RS_10]                        |
> |               PartitionCols:_col0                  |
> |               Group By Operator [GBY_9] (rows=343640771 width=4160) |
> |                 Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
> |                 Select Operator [SEL_2] (rows=343640771 width=4160) |
> |                   Output:["mid"]                   |
> |                   TableScan [TS_0] (rows=343640771 width=4160) |
> |                     db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> |                                                    |
> +----------------------------------------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)