You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Eugene Chung <bl...@gmail.com> on 2020/07/30 00:43:11 UTC

count(*) with count(distinct) gives different results between Hive 2.3.2 and Hive 3.1.2

Hi,

For the same query, for example,

select count(*), count(distinct mid)
from db1.table1
where log_date between '2020-07-20' and '2020-07-26';


both Hive 2.3.2 and Hive 3.1.2 give different results for the same input.
Note that db1.table1 is an ORC table and partitioned with the log_date
column.

Hive 2.3.2
+-----------+-----------+
|    _c0    |    _c1    |
+-----------+-----------+
| 60040555  | 14723258  |
+-----------+-----------+

Hive 3.1.2
+-----------+-----------+
|    _c0    |    _c1    |
+-----------+-----------+
| 14723259  | 14723258  |
+-----------+-----------+


Both plans are completely different, too.

Hive 2.3.2
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 2                                    |
|       File Output Operator [FS_7]                  |
|         Group By Operator [GBY_5] (rows=1 width=24) |
|
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT
KEY._col0:0._col0)"] |
|         <-Map 1 [SIMPLE_EDGE]                      |
|           SHUFFLE [RS_4]                           |
|             Group By Operator [GBY_3] (rows=1429329075 width=100) |
|
Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT
mid)"],keys:mid |
|               Select Operator [SEL_2] (rows=1429329075 width=100) |
|                 Output:["mid"]                     |
|                 TableScan [TS_0] (rows=1429329075 width=100) |
|                   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
|                                                    |
+----------------------------------------------------+


Hive 3.1.2
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 2                                    |
|       File Output Operator [FS_7]                  |
|         Group By Operator [GBY_14] (rows=1 width=16) |
|
Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
|           Group By Operator [GBY_11] (rows=343640771 width=4160) |
|
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
|
|           <-Map 1 [SIMPLE_EDGE]                    |
|             SHUFFLE [RS_10]                        |
|               PartitionCols:_col0                  |
|               Group By Operator [GBY_9] (rows=343640771 width=4160) |
|
Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
|                 Select Operator [SEL_2] (rows=343640771 width=4160) |
|                   Output:["mid"]                   |
|                   TableScan [TS_0] (rows=343640771 width=4160) |
|
japan_line_search_refine_log@daily_kpi_log,daily_kpi_log,Tbl:COMPLETE,Col:NONE,Output:["mid"]
|
|                                                    |
+----------------------------------------------------+

Any guess on the reason why they are different?


Best regards,
Eugene Chung (Korean : 정의근)

Re: count(*) with count(distinct) gives different results between Hive 2.3.2 and Hive 3.1.2

Posted by Eugene Chung <bl...@gmail.com>.
I found that

hive.optimize.countdistinct=true;

is the problem, It looks like
https://issues.apache.org/jira/browse/HIVE-16654 made the side effect.

Best regards,
Eugene Chung (Korean : 정의근)