You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Nemon Lou (Jira)" <ji...@apache.org> on 2021/01/04 03:44:00 UTC
[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

    [ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17257932#comment-17257932 ] 

Nemon Lou commented on HIVE-24579:
----------------------------------

A workaround is hive.limit.pushdown.memory.usage=0 .

 

> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
>                 Key: HIVE-24579
>                 URL: https://issues.apache.org/jira/browse/HIVE-24579
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.7, 3.1.2, 4.0.0
>            Reporter: Nemon Lou
>            Priority: Critical
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>  Stage: Stage-1
>  Map Reduce
>  Map Operator Tree:
>  TableScan
>  alias: test
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats: NONE
>  GatherStats: false
>  Select Operator
>  expressions: id (type: int)
>  outputColumnNames: id
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats: NONE
>  Group By Operator
>  aggregations: count()
>  keys: id (type: int)
>  mode: hash
>  outputColumnNames: _col0, _col1
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats: NONE
>  Reduce Output Operator
>  key expressions: _col0 (type: int)
>  null sort order: a
>  sort order: +
>  Map-reduce partition columns: _col0 (type: int)
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats: NONE
>  tag: -1
>  TopN: 10
>  TopN Hash Memory Usage: 0.1
>  value expressions: _col1 (type: bigint)
>  auto parallelism: false
>  Path -> Alias:
>  file:/user/hive/warehouse/test [test]
>  Path -> Partition:
>  file:/user/hive/warehouse/test 
>  Partition
>  base file name: test
>  input format: org.apache.hadoop.mapred.TextInputFormat
>  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  properties:
>  COLUMN_STATS_ACCURATE \{"BASIC_STATS":"true"}
>  bucket_count -1
>  column.name.delimiter ,
>  columns id
>  columns.comments 
>  columns.types int
>  file.inputformat org.apache.hadoop.mapred.TextInputFormat
>  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  location file:/user/hive/warehouse/test
>  name default.test
>  numFiles 0
>  numRows 0
>  rawDataSize 0
>  serialization.ddl struct test \{ i32 id}
>  serialization.format 1
>  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  totalSize 0
>  transient_lastDdlTime 1609730036
>  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  
>  input format: org.apache.hadoop.mapred.TextInputFormat
>  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  properties:
>  COLUMN_STATS_ACCURATE \{"BASIC_STATS":"true"}
>  bucket_count -1
>  column.name.delimiter ,
>  columns id
>  columns.comments 
>  columns.types int
>  file.inputformat org.apache.hadoop.mapred.TextInputFormat
>  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  location file:/user/hive/warehouse/test
>  name default.test
>  numFiles 0
>  numRows 0
>  rawDataSize 0
>  serialization.ddl struct test \{ i32 id}
>  serialization.format 1
>  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  totalSize 0
>  transient_lastDdlTime 1609730036
>  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  name: default.test
>  name: default.test
>  Truncated Path -> Alias:
>  /test [test]
>  Needs Tagging: false
>  Reduce Operator Tree:
>  Group By Operator
>  aggregations: count(VALUE._col0)
>  keys: KEY._col0 (type: int)
>  mode: mergepartial
>  outputColumnNames: _col0, _col1
>  Statistics: Num rows: 168 Data size: 672 Basic stats: COMPLETE Column stats: NONE
>  Limit
>  Number of rows: 10
>  Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
>  File Output Operator
>  compressed: false
>  GlobalTableId: 0
>  directory: file:/tmp/root/bd08973b-b58c-4185-9072-c1891f67878d/hive_2021-01-04_11-14-01_745_4475755683092435506-1/-mr-10001/.hive-staging_hive_2021-01-04_11-14-01_745_4475755683092435506-1/-ext-10002
>  NumFilesPerFileSink: 1
>  Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
>  Stats Publishing Key Prefix: file:/tmp/root/bd08973b-b58c-4185-9072-c1891f67878d/hive_2021-01-04_11-14-01_745_4475755683092435506-1/-mr-10001/.hive-staging_hive_2021-01-04_11-14-01_745_4475755683092435506-1/-ext-10002/
>  table:
>  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  properties:
>  columns _col0,_col1
>  columns.types int:bigint
>  escape.delim \
>  hive.serialization.extend.additional.nesting.levels true
>  serialization.escape.crlf true
>  serialization.format 1
>  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  TotalFiles: 1
>  GatherStats: false
>  MultiFileSpray: false
> Stage: Stage-0
>  Fetch Operator
>  limit: 10
>  Processor Tree:
>  ListSink
> Time taken: 1.877 seconds, Fetched: 128 row(s)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)