You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Nemon Lou (Jira)" <ji...@apache.org> on 2021/09/11 07:03:00 UTC
[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby
With Limit
[ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413481#comment-17413481 ]
Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:02 AM:
------------------------------------------------------------
Thanks [~kkasa] for your attention.
This issue only happens on a customer's cluster, and i could not get the data.
This simplified reproduce step seems not match the customer's issue.
Here is the original issue(with table name changed):
The query result is different for the same store_id when change limit 10 to limit 100
{code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','yyyyMMdd')-86400*29,'yyyyMMdd') as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}
query plan :
{code:sql}
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:10 |
| Stage-1 |
| Reducer 2 |
| File Output Operator [FS_8] |
| Limit [LIM_7] (rows=10 width=39) |
| Number of rows:10 |
| Group By Operator [GBY_5] (rows=5618832 width=39) |
| Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
| <-Map 1 [SIMPLE_EDGE] |
| SHUFFLE [RS_4] |
| PartitionCols:_col0 |
| Group By Operator [GBY_3] (rows=11237665 width=39) |
| Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
| Output:["store_id"] |
| Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
| db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] |
{code}
part of the extended plan:
{code:sql}
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| null sort order: a |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE |
| tag: -1 |
| TopN: 10 |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col1 (type: bigint) |
| auto parallelism: true |
{code}
was (Author: nemon):
Thanks [~kkasa] for your attention.
This issue only happens on a customer's cluster, and i could not get the data.
This simplified reproduce step seems not match the customer's issue.
Here is the original sql (with table name changed):
{code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','yyyyMMdd')-86400*29,'yyyyMMdd') as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}
query plan :
{code:sql}
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:10 |
| Stage-1 |
| Reducer 2 |
| File Output Operator [FS_8] |
| Limit [LIM_7] (rows=10 width=39) |
| Number of rows:10 |
| Group By Operator [GBY_5] (rows=5618832 width=39) |
| Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
| <-Map 1 [SIMPLE_EDGE] |
| SHUFFLE [RS_4] |
| PartitionCols:_col0 |
| Group By Operator [GBY_3] (rows=11237665 width=39) |
| Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
| Output:["store_id"] |
| Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
| db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] |
{code}
part of the extended plan:
{code:sql}
Reduce Output Operator |
key expressions: _col0 (type: string) |
null sort order: a |
sort order: + |
Map-reduce partition columns: _col0 (type: string) |
Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE |
tag: -1 |
TopN: 100 |
TopN Hash Memory Usage: 0.1 |
value expressions: _col1 (type: bigint) |
auto parallelism: true |
{code}
> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
> Issue Type: Bug
> Affects Versions: 2.3.7, 3.1.2, 4.0.0
> Reporter: Nemon Lou
> Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: test
> Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
> GatherStats: false
> Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count()
> keys: id (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
> file:/user/hive/warehouse/test [test]
> Path -> Partition:
> file:/user/hive/warehouse/test
> Partition
> base file name: test
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> name: default.test
> name: default.test
> Truncated Path -> Alias:
> /test [test]
> Reducer 2
> Execution mode: vectorized
> Needs Tagging: false
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
> Limit
> Number of rows: 10
> Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> GlobalTableId: 0
> directory: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
> NumFilesPerFileSink: 1
> Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
> Stats Publishing Key Prefix: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> properties:
> columns _col0,_col1
> columns.types int:bigint
> escape.delim \
> hive.serialization.extend.additional.nesting.levels true
> serialization.escape.crlf true
> serialization.format 1
> serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> TotalFiles: 1
> GatherStats: false
> MultiFileSpray: false
> Stage: Stage-0
> Fetch Operator
> limit: 10
> Processor Tree:
> ListSink
> Time taken: 0.102 seconds, Fetched: 143 row(s)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)