You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Eugene Chung (Jira)" <ji...@apache.org> on 2020/08/07 09:39:00 UTC
[jira] [Assigned] (HIVE-23954) count(*) with count(distinct) gives
wrong results with hive.optimize.countdistinct=true
[ https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Chung reassigned HIVE-23954:
-----------------------------------
Assignee: Eugene Chung
> count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
> ---------------------------------------------------------------------------------------
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
> Issue Type: Bug
> Components: Logical Optimizer
> Affects Versions: 3.0.0, 3.1.0
> Reporter: Eugene Chung
> Assignee: Eugene Chung
> Priority: Major
> Attachments: HIVE-23954.01.patch
>
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column = '...'{code}
>
> is not working properly when hive.optimize.countdistinct is true. By default, it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By Operator of Map 1 are different.
>
> - hive.optimize.countdistinct=false
> {code:java}
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | Plan optimized by CBO. |
> | |
> | Vertex dependency in root stage |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE) |
> | |
> | Stage-0 |
> | Fetch Operator |
> | limit:-1 |
> | Stage-1 |
> | Reducer 2 |
> | File Output Operator [FS_7] |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE] |
> | SHUFFLE [RS_4] |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> | Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT mid)"],keys:mid |
> | Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> | |
> +----------------------------------------------------+{code}
>
> - hive.optimize.countdistinct=true
> {code:java}
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | Plan optimized by CBO. |
> | |
> | Vertex dependency in root stage |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE) |
> | |
> | Stage-0 |
> | Fetch Operator |
> | limit:-1 |
> | Stage-1 |
> | Reducer 2 |
> | File Output Operator [FS_7] |
> | Group By Operator [GBY_14] (rows=1 width=16) |
> | Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
> | Group By Operator [GBY_11] (rows=343640771 width=4160) |
> | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
> | <-Map 1 [SIMPLE_EDGE] |
> | SHUFFLE [RS_10] |
> | PartitionCols:_col0 |
> | Group By Operator [GBY_9] (rows=343640771 width=4160) |
> | Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
> | Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> | |
> +----------------------------------------------------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)