You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Rich Haase (JIRA)" <ji...@apache.org> on 2015/06/27 00:29:04 UTC
[jira] [Commented] (HIVE-11132) Queries using join and group by
produce incorrect output when hive.auto.convert.join=false and
hive.optimize.reducededuplication=true
[ https://issues.apache.org/jira/browse/HIVE-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603705#comment-14603705 ]
Rich Haase commented on HIVE-11132:
-----------------------------------
Explain plan when hive.auto.convert.join=false and hive.optimize.reducededuplication=true:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: mooo
Statistics: Num rows: 1511511 Data size: 3402058087 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (((oppty_id is not null and oppty_line_id is not null) and (order_order_system <> 'sfdc_performance')) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: mooo_s
Statistics: Num rows: 1511511 Data size: 940228122 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: forecast
Statistics: Num rows: 29923099 Data size: 7723657280 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: split
Statistics: Num rows: 2072636 Data size: 524862652 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
Inner Join 0 to 2
Inner Join 0 to 3
condition expressions:
0
1
2
3
Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: '0064000000CZbnWAAT' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: _col0 (type: string)
mode: complete
outputColumnNames: _col0, _col1
Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Explain plan when hive.auto.convert.join=true and hive.optimize.reducededuplication=true
STAGE DEPENDENCIES:
Stage-10 is a root stage , consists of Stage-1
Stage-1
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-10
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: mooo
Statistics: Num rows: 1511511 Data size: 3402058087 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (((oppty_id is not null and oppty_line_id is not null) and (order_order_system <> 'sfdc_performance')) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: mooo_s
Statistics: Num rows: 1511511 Data size: 940228122 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: forecast
Statistics: Num rows: 29923099 Data size: 7723657280 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: split
Statistics: Num rows: 2072636 Data size: 524862652 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
sort order: ++
Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id (type: string)
Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
Inner Join 0 to 2
Inner Join 0 to 3
condition expressions:
0
1
2
3
Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: '0064000000CZbnWAAT' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
> Queries using join and group by produce incorrect output when hive.auto.convert.join=false and hive.optimize.reducededuplication=true
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-11132
> URL: https://issues.apache.org/jira/browse/HIVE-11132
> Project: Hive
> Issue Type: Bug
> Affects Versions: 0.14.0
> Reporter: Rich Haase
>
> Queries using join and group by produce multiple output rows with the same key when hive.auto.convert.join=false and hive.optimize.reducededuplication=true. This interaction between configuration parameters is unexpected and should be well documented at the very least and should likely be considered a bug.
> e.g.
> hive> set hive.auto.convert.join = false;
> hive> set hive.optimize.reducededuplication = true;
> hive> SELECT foo.id, count(*) as factor
> > FROM foo
> > JOIN bar ON (foo.id = bar.id and foo.line_id = bar.line_id)
> > JOIN split ON (foo.id = split.id and foo.line_id = split.line_id)
> > JOIN forecast ON (foo.id = forecast.id AND foo.line_id = forecast.line_id)
> > WHERE foo.order != ‘blah’ AND foo.id = ‘XYZ'
> > GROUP BY foo.id;
> XYZ 79
> XYZ 74
> XYZ 297
> XYZ 66
> hive> set hive.auto.convert.join = true;
> hive> set hive.optimize.reducededuplication = true;
> hive> SELECT foo.id, count(*) as factor
> > FROM foo
> > JOIN bar ON (foo.id = bar.id and foo.line_id = bar.line_id)
> > JOIN split ON (foo.id = split.id and foo.line_id = split.line_id)
> > JOIN forecast ON (foo.id = forecast.id AND foo.line_id = forecast.line_id)
> > WHERE foo.order != ‘blah’ AND foo.id = ‘XYZ'
> > GROUP BY foo.id;
> XYZ 516
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)