You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Krisztian Kasa (Jira)" <ji...@apache.org> on 2021/09/10 12:14:00 UTC
[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

    [ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413144#comment-17413144 ] 

Krisztian Kasa commented on HIVE-24579:
---------------------------------------

[~nemon]
The TopN is introduced because the Reduce Sink operator does sorting by the id column and there is also a limit 10 defined. So it broadcast only the first 10 rows to the reducer.

Could you please provide some test data which can be used to reproduce the incorrect result you mention in the summary?

> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
>                 Key: HIVE-24579
>                 URL: https://issues.apache.org/jira/browse/HIVE-24579
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.7, 3.1.2, 4.0.0
>            Reporter: Nemon Lou
>            Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>       DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: test
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                   GatherStats: false
>                   Select Operator
>                     expressions: id (type: int)
>                     outputColumnNames: id
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                     Group By Operator
>                       aggregations: count()
>                       keys: id (type: int)
>                       mode: hash
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                         tag: -1
>                         TopN: 10
>                         TopN Hash Memory Usage: 0.1
>                         value expressions: _col1 (type: bigint)
>                         auto parallelism: true
>             Execution mode: vectorized
>             Path -> Alias:
>               file:/user/hive/warehouse/test [test]
>             Path -> Partition:
>               file:/user/hive/warehouse/test 
>                 Partition
>                   base file name: test
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   properties:
>                     COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                     bucket_count -1
>                     bucketing_version 2
>                     column.name.delimiter ,
>                     columns id
>                     columns.comments 
>                     columns.types int
>                     file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                     file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     location file:/user/hive/warehouse/test
>                     name default.test
>                     numFiles 0
>                     numRows 0
>                     rawDataSize 0
>                     serialization.ddl struct test { i32 id}
>                     serialization.format 1
>                     serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     totalSize 0
>                     transient_lastDdlTime 1609730190
>                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     properties:
>                       COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                       bucket_count -1
>                       bucketing_version 2
>                       column.name.delimiter ,
>                       columns id
>                       columns.comments 
>                       columns.types int
>                       file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                       file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       location file:/user/hive/warehouse/test
>                       name default.test
>                       numFiles 0
>                       numRows 0
>                       rawDataSize 0
>                       serialization.ddl struct test { i32 id}
>                       serialization.format 1
>                       serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                       totalSize 0
>                       transient_lastDdlTime 1609730190
>                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     name: default.test
>                   name: default.test
>             Truncated Path -> Alias:
>               /test [test]
>         Reducer 2 
>             Execution mode: vectorized
>             Needs Tagging: false
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                 Limit
>                   Number of rows: 10
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     GlobalTableId: 0
>                     directory: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
>                     NumFilesPerFileSink: 1
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
>                     Stats Publishing Key Prefix: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
>                     table:
>                         input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         properties:
>                           columns _col0,_col1
>                           columns.types int:bigint
>                           escape.delim \
>                           hive.serialization.extend.additional.nesting.levels true
>                           serialization.escape.crlf true
>                           serialization.format 1
>                           serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     TotalFiles: 1
>                     GatherStats: false
>                     MultiFileSpray: false
>   Stage: Stage-0
>     Fetch Operator
>       limit: 10
>       Processor Tree:
>         ListSink
> Time taken: 0.102 seconds, Fetched: 143 row(s)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)