You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2013/10/01 03:04:24 UTC
[jira] [Updated] (HIVE-5357) ReduceSinkDeDuplication optimizer pick
the wrong keys in pRS-cGBYm-cRS-cGBYr scenario when there are distinct keys
in child GBY
[ https://issues.apache.org/jira/browse/HIVE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair updated HIVE-5357:
--------------------------------
Fix Version/s: (was: 0.13.0)
0.12.0
> ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr scenario when there are distinct keys in child GBY
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5357
> URL: https://issues.apache.org/jira/browse/HIVE-5357
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.11.0
> Reporter: Chun Chen
> Assignee: Chun Chen
> Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: HIVE-5357.patch
>
>
> Example:
> {code}
> select key, count(distinct value) from (select key, value from src group by key, value) t group by key;
> //result
> 0 0 NULL
> 10 10 NULL
> 100 100 NULL
> 103 103 NULL
> 104 104 NULL
> {code}
> Obviously the result is wrong.
> When we have a simple group by query with a distinct column
> {code}
> explain select count(distinct value) from src group by key;
> {code}
> The plan is
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> src
> TableScan
> alias: src
> Select Operator
> expressions:
> expr: key
> type: string
> expr: value
> type: string
> outputColumnNames: key, value
> Group By Operator
> aggregations:
> expr: count(DISTINCT value)
> bucketGroup: false
> keys:
> expr: key
> type: string
> expr: value
> type: string
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> tag: -1
> value expressions:
> expr: _col2
> type: bigint
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: count(DISTINCT KEY._col1:0._col0)
> bucketGroup: false
> keys:
> expr: KEY._col0
> type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col1
> type: bigint
> outputColumnNames: _col0
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {code}
> The map side GBY also adds the distinct columns (value in this case) to its key columns.
> When RSDedup optimizes a query involving a GBY with distinct keys, if map-side aggregation is enabled, currently it assigns the map-side GBY's key columns to the reduce-side GBY. So, for the example shown at the beginning, after we generate a plan with a single MR job, the second GBY in the reduce-side uses both key and value as its key columns. The correct key column is key.
--
This message was sent by Atlassian JIRA
(v6.1#6144)