You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Eugene Chung (Jira)" <ji...@apache.org> on 2020/08/04 11:52:00 UTC
[jira] [Comment Edited] (HIVE-23954) count(*) with count(distinct)
gives wrong results with hive.optimize.countdistinct=true
[ https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170730#comment-17170730 ]
Eugene Chung edited comment on HIVE-23954 at 8/4/20, 11:51 AM:
---------------------------------------------------------------
[~kgyrtkirk] Yes, it fixes with set hive.optimize.reducededuplication=false.
{code:java}
set hive.optimize.reducededuplication=false;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (SIMPLE_EDGE) |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Reducer 3 |
| File Output Operator [FS_7] |
| Group By Operator [GBY_14] (rows=1 width=16) |
| Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(VALUE._col1)"] |
| <-Reducer 2 [CUSTOM_SIMPLE_EDGE] |
| PARTITION_ONLY_SHUFFLE [RS_13] |
| Group By Operator [GBY_12] (rows=1 width=16) |
| Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
| Group By Operator [GBY_11] (rows=343640771 width=4160) |
| Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
| <-Map 1 [SIMPLE_EDGE] |
| SHUFFLE [RS_10] |
| PartitionCols:_col0 |
| Group By Operator [GBY_9] (rows=343640771 width=4160) |
| Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
| Select Operator [SEL_2] (rows=343640771 width=4160) |
| Output:["mid"] |
| TableScan [TS_0] (rows=343640771 width=4160) |
| db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
| |
+----------------------------------------------------+
{code}
was (Author: euigeun_chung):
[~kgyrtkirk] Yes, with set hive.optimize.reducededuplication=false.
{code:java}
set hive.optimize.reducededuplication=false;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (SIMPLE_EDGE) |
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Reducer 3 |
| File Output Operator [FS_7] |
| Group By Operator [GBY_14] (rows=1 width=16) |
| Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(VALUE._col1)"] |
| <-Reducer 2 [CUSTOM_SIMPLE_EDGE] |
| PARTITION_ONLY_SHUFFLE [RS_13] |
| Group By Operator [GBY_12] (rows=1 width=16) |
| Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
| Group By Operator [GBY_11] (rows=343640771 width=4160) |
| Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
| <-Map 1 [SIMPLE_EDGE] |
| SHUFFLE [RS_10] |
| PartitionCols:_col0 |
| Group By Operator [GBY_9] (rows=343640771 width=4160) |
| Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
| Select Operator [SEL_2] (rows=343640771 width=4160) |
| Output:["mid"] |
| TableScan [TS_0] (rows=343640771 width=4160) |
| db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
| |
+----------------------------------------------------+
{code}
> count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
> ---------------------------------------------------------------------------------------
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
> Issue Type: Bug
> Components: Logical Optimizer
> Affects Versions: 3.0.0, 3.1.0
> Reporter: Eugene Chung
> Priority: Major
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column = '...'{code}
>
> is not working properly when hive.optimize.countdistinct is true. By default, it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By Operator of Map 1 are different.
>
> - hive.optimize.countdistinct=false
> {code:java}
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | Plan optimized by CBO. |
> | |
> | Vertex dependency in root stage |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE) |
> | |
> | Stage-0 |
> | Fetch Operator |
> | limit:-1 |
> | Stage-1 |
> | Reducer 2 |
> | File Output Operator [FS_7] |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE] |
> | SHUFFLE [RS_4] |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> | Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT mid)"],keys:mid |
> | Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> | |
> +----------------------------------------------------+{code}
>
> - hive.optimize.countdistinct=true
> {code:java}
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | Plan optimized by CBO. |
> | |
> | Vertex dependency in root stage |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE) |
> | |
> | Stage-0 |
> | Fetch Operator |
> | limit:-1 |
> | Stage-1 |
> | Reducer 2 |
> | File Output Operator [FS_7] |
> | Group By Operator [GBY_14] (rows=1 width=16) |
> | Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
> | Group By Operator [GBY_11] (rows=343640771 width=4160) |
> | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
> | <-Map 1 [SIMPLE_EDGE] |
> | SHUFFLE [RS_10] |
> | PartitionCols:_col0 |
> | Group By Operator [GBY_9] (rows=343640771 width=4160) |
> | Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
> | Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> | |
> +----------------------------------------------------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)