You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "wangmeng (JIRA)" <ji...@apache.org> on 2015/06/09 11:02:01 UTC
[jira] [Commented] (HIVE-10971) count(*) with count(distinct) gives wrong results when hive.groupby.skewindata=true

    [ https://issues.apache.org/jira/browse/HIVE-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578602#comment-14578602 ] 

wangmeng commented on HIVE-10971:
---------------------------------

{code}
hive> set  hive.groupby.skewindata=true;
hive> explain select l_returnflag,count(*),count(distinct  l_linestatus) from lineitem  group by l_returnflag  limit 10;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: lineitem
            Statistics: Num rows: 1008537518 Data size: 201707503616 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: l_returnflag (type: string), l_linestatus (type: string)
              outputColumnNames: l_returnflag, l_linestatus
              Statistics: Num rows: 1008537518 Data size: 201707503616 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(), count(DISTINCT l_linestatus)
                keys: l_returnflag (type: string), l_linestatus (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 1008537518 Data size: 201707503616 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1008537518 Data size: 201707503616 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0), count(DISTINCT KEY._col1:0._col0)
          keys: KEY._col0 (type: string)
          mode: complete
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 504268759 Data size: 100853751808 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: bigint), _col2 (type: bigint)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 504268759 Data size: 100853751808 Basic stats: COMPLETE Column stats: NONE
            Limit
              Number of rows: 10
              Statistics: Num rows: 10 Data size: 2000 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: true
                Statistics: Num rows: 10 Data size: 2000 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
{code}

When hive.groupby.skewindata=false, the Group By operator has mode "mergepartial", which gives the correct results.

> count(*) with count(distinct) gives wrong results when hive.groupby.skewindata=true
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-10971
>                 URL: https://issues.apache.org/jira/browse/HIVE-10971
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.2.0
>            Reporter: wangmeng
>            Assignee: wangmeng
>
> When hive.groupby.skewindata=true, the following query based on TPC-H gives wrong results:
> {code}
> set hive.groupby.skewindata=true;
> select l_returnflag, count(*), count(distinct l_linestatus)
> from lineitem
> group by l_returnflag
> limit 10;
> {code}
> The query plan shows that it generates only one MapReduce job instead of two theoretically, which is dictated by hive.groupby.skewindata=true.
> The problem arises only when {noformat}count(*){noformat} and {noformat}count(distinct){noformat} exist together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)