You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Nemon Lou (Jira)" <ji...@apache.org> on 2021/09/11 07:03:00 UTC

[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby With Limit

    [ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413481#comment-17413481 ] 

Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:02 AM:
------------------------------------------------------------

Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original issue(with table name changed):

 The query result is different for the same store_id when change limit 10 to limit 100
{code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','yyyyMMdd')-86400*29,'yyyyMMdd') as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}
query plan :
  
{code:sql}
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:10                                       |
|     Stage-1                                        |
|       Reducer 2                                    |
|       File Output Operator [FS_8]                  |
|         Limit [LIM_7] (rows=10 width=39)           |
|           Number of rows:10                        |
|           Group By Operator [GBY_5] (rows=5618832 width=39) |
|             Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|           <-Map 1 [SIMPLE_EDGE]                    |
|             SHUFFLE [RS_4]                         |
|               PartitionCols:_col0                  |
|               Group By Operator [GBY_3] (rows=11237665 width=39) |
|                 Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
|                 Select Operator [SEL_2] (rows=11237665 width=39) |
|                   Output:["store_id"]              |
|                   Filter Operator [FIL_9] (rows=11237665 width=39) |
|                     predicate:(NVL(is_curr_bound,1) = 1) |
|                     TableScan [TS_0] (rows=22475330 width=39) |
|                       db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] |
{code}
 part of the extended plan:
{code:sql}
 |                         Reduce Output Operator     |
|                           key expressions: _col0 (type: string) |
|                           null sort order: a       |
|                           sort order: +            |
|                           Map-reduce partition columns: _col0 (type: string) |
|                           Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE |
|                           tag: -1                  |
|                           TopN: 10                 |
|                           TopN Hash Memory Usage: 0.1 |
|                           value expressions: _col1 (type: bigint) |
|                           auto parallelism: true   |
{code}


was (Author: nemon):
Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original sql (with table name changed):

 {code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','yyyyMMdd')-86400*29,'yyyyMMdd') as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}

query plan :
 {code:sql}
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:10                                       |
|     Stage-1                                        |
|       Reducer 2                                    |
|       File Output Operator [FS_8]                  |
|         Limit [LIM_7] (rows=10 width=39)           |
|           Number of rows:10                        |
|           Group By Operator [GBY_5] (rows=5618832 width=39) |
|             Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|           <-Map 1 [SIMPLE_EDGE]                    |
|             SHUFFLE [RS_4]                         |
|               PartitionCols:_col0                  |
|               Group By Operator [GBY_3] (rows=11237665 width=39) |
|                 Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
|                 Select Operator [SEL_2] (rows=11237665 width=39) |
|                   Output:["store_id"]              |
|                   Filter Operator [FIL_9] (rows=11237665 width=39) |
|                     predicate:(NVL(is_curr_bound,1) = 1) |
|                     TableScan [TS_0] (rows=22475330 width=39) |
|                       db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] |
{code}
 part of the extended plan:
{code:sql}
  Reduce Output Operator     |
                           key expressions: _col0 (type: string) |
                           null sort order: a       |
                           sort order: +            |
                           Map-reduce partition columns: _col0 (type: string) |
                           Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE |
                           tag: -1                  |
                           TopN: 100                |
                           TopN Hash Memory Usage: 0.1 |
                           value expressions: _col1 (type: bigint) |
                           auto parallelism: true   |
{code}


> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
>                 Key: HIVE-24579
>                 URL: https://issues.apache.org/jira/browse/HIVE-24579
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.7, 3.1.2, 4.0.0
>            Reporter: Nemon Lou
>            Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>       DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: test
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                   GatherStats: false
>                   Select Operator
>                     expressions: id (type: int)
>                     outputColumnNames: id
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                     Group By Operator
>                       aggregations: count()
>                       keys: id (type: int)
>                       mode: hash
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                         tag: -1
>                         TopN: 10
>                         TopN Hash Memory Usage: 0.1
>                         value expressions: _col1 (type: bigint)
>                         auto parallelism: true
>             Execution mode: vectorized
>             Path -> Alias:
>               file:/user/hive/warehouse/test [test]
>             Path -> Partition:
>               file:/user/hive/warehouse/test 
>                 Partition
>                   base file name: test
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   properties:
>                     COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                     bucket_count -1
>                     bucketing_version 2
>                     column.name.delimiter ,
>                     columns id
>                     columns.comments 
>                     columns.types int
>                     file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                     file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     location file:/user/hive/warehouse/test
>                     name default.test
>                     numFiles 0
>                     numRows 0
>                     rawDataSize 0
>                     serialization.ddl struct test { i32 id}
>                     serialization.format 1
>                     serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     totalSize 0
>                     transient_lastDdlTime 1609730190
>                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     properties:
>                       COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                       bucket_count -1
>                       bucketing_version 2
>                       column.name.delimiter ,
>                       columns id
>                       columns.comments 
>                       columns.types int
>                       file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                       file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       location file:/user/hive/warehouse/test
>                       name default.test
>                       numFiles 0
>                       numRows 0
>                       rawDataSize 0
>                       serialization.ddl struct test { i32 id}
>                       serialization.format 1
>                       serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                       totalSize 0
>                       transient_lastDdlTime 1609730190
>                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     name: default.test
>                   name: default.test
>             Truncated Path -> Alias:
>               /test [test]
>         Reducer 2 
>             Execution mode: vectorized
>             Needs Tagging: false
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                 Limit
>                   Number of rows: 10
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     GlobalTableId: 0
>                     directory: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
>                     NumFilesPerFileSink: 1
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                     Stats Publishing Key Prefix: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
>                     table:
>                         input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         properties:
>                           columns _col0,_col1
>                           columns.types int:bigint
>                           escape.delim \
>                           hive.serialization.extend.additional.nesting.levels true
>                           serialization.escape.crlf true
>                           serialization.format 1
>                           serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     TotalFiles: 1
>                     GatherStats: false
>                     MultiFileSpray: false
>   Stage: Stage-0
>     Fetch Operator
>       limit: 10
>       Processor Tree:
>         ListSink
> Time taken: 0.102 seconds, Fetched: 143 row(s)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)