You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Eugene Chung <bl...@gmail.com> on 2020/07/30 00:43:11 UTC
count(*) with count(distinct) gives different results between Hive
2.3.2 and Hive 3.1.2
Hi,
For the same query, for example,
select count(*), count(distinct mid)
from db1.table1
where log_date between '2020-07-20' and '2020-07-26';
both Hive 2.3.2 and Hive 3.1.2 give different results for the same input.
Note that db1.table1 is an ORC table and partitioned with the log_date
column.
Hive 2.3.2
+-----------+-----------+
| _c0 | _c1 |
+-----------+-----------+
| 60040555 | 14723258 |
+-----------+-----------+
Hive 3.1.2
+-----------+-----------+
| _c0 | _c1 |
+-----------+-----------+
| 14723259 | 14723258 |
+-----------+-----------+
Both plans are completely different, too.
Hive 2.3.2
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Reducer 2 |
| File Output Operator [FS_7] |
| Group By Operator [GBY_5] (rows=1 width=24) |
|
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT
KEY._col0:0._col0)"] |
| <-Map 1 [SIMPLE_EDGE] |
| SHUFFLE [RS_4] |
| Group By Operator [GBY_3] (rows=1429329075 width=100) |
|
Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT
mid)"],keys:mid |
| Select Operator [SEL_2] (rows=1429329075 width=100) |
| Output:["mid"] |
| TableScan [TS_0] (rows=1429329075 width=100) |
| db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
| |
+----------------------------------------------------+
Hive 3.1.2
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Reducer 2 |
| File Output Operator [FS_7] |
| Group By Operator [GBY_14] (rows=1 width=16) |
|
Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
| Group By Operator [GBY_11] (rows=343640771 width=4160) |
|
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
|
| <-Map 1 [SIMPLE_EDGE] |
| SHUFFLE [RS_10] |
| PartitionCols:_col0 |
| Group By Operator [GBY_9] (rows=343640771 width=4160) |
|
Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
| Select Operator [SEL_2] (rows=343640771 width=4160) |
| Output:["mid"] |
| TableScan [TS_0] (rows=343640771 width=4160) |
|
japan_line_search_refine_log@daily_kpi_log,daily_kpi_log,Tbl:COMPLETE,Col:NONE,Output:["mid"]
|
| |
+----------------------------------------------------+
Any guess on the reason why they are different?
Best regards,
Eugene Chung (Korean : 정의근)
Re: count(*) with count(distinct) gives different results between
Hive 2.3.2 and Hive 3.1.2
Posted by Eugene Chung <bl...@gmail.com>.
I found that
hive.optimize.countdistinct=true;
is the problem, It looks like
https://issues.apache.org/jira/browse/HIVE-16654 made the side effect.
Best regards,
Eugene Chung (Korean : 정의근)