You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Janaki Lahorani (JIRA)" <ji...@apache.org> on 2018/09/17 02:04:00 UTC
[jira] [Updated] (HIVE-20570) Union ALL with
hive.optimize.union.remove=true has incorrect plan
[ https://issues.apache.org/jira/browse/HIVE-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Janaki Lahorani updated HIVE-20570:
-----------------------------------
Description:
When hive.optimize.union.remove=true and a select query is run with group by, the final fetch is waiting only for one of the branches and not both.
Test Case:
{code}
create table if not exists test_table(column1 string, column2 int);
insert into test_table values('a',1),('b',2);
set hive.optimize.union.remove=true;
set mapred.input.dir.recursive=true;
explain
select column1 from test_table group by column1
union all
select column1 from test_table group by column1;
{code}
In the below the two stages correspond to the two parts of union all. But the final fetch operator (Stage 0) only depends on one of the stages, but it should depend on both.
Plan:
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 is a root stage
*Stage-0 depends on stages: Stage-1*
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test_table
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: column1 (type: string)
outputColumnNames: column1
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: column1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
alias: test_table
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: column1 (type: string)
outputColumnNames: column1
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: column1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
was:
When hive.optimize.union.remove=true and a select query is run with group by, the final fetch is waiting only for one of the branches and not both.
Test Case:
{code}
create table if not exists test_table(column1 string, column2 int);
insert into test_table values('a',1),('b',2);
set hive.optimize.union.remove=true;
set mapred.input.dir.recursive=true;
explain
select column1 from test_table group by column1
union all
select column1 from test_table group by column1;
{code}
In the below the two stages correspond to the two parts of union all. But the final fetch operator (Stage 0) only depends on one of the stages, but it should depend on both.
Plan:
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 is a root stage
* Stage-0 depends on stages: Stage-1*
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test_table
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: column1 (type: string)
outputColumnNames: column1
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: column1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
alias: test_table
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: column1 (type: string)
outputColumnNames: column1
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: column1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
> Union ALL with hive.optimize.union.remove=true has incorrect plan
> -----------------------------------------------------------------
>
> Key: HIVE-20570
> URL: https://issues.apache.org/jira/browse/HIVE-20570
> Project: Hive
> Issue Type: Bug
> Reporter: Janaki Lahorani
> Assignee: Janaki Lahorani
> Priority: Major
>
> When hive.optimize.union.remove=true and a select query is run with group by, the final fetch is waiting only for one of the branches and not both.
> Test Case:
> {code}
> create table if not exists test_table(column1 string, column2 int);
> insert into test_table values('a',1),('b',2);
> set hive.optimize.union.remove=true;
> set mapred.input.dir.recursive=true;
> explain
> select column1 from test_table group by column1
> union all
> select column1 from test_table group by column1;
> {code}
> In the below the two stages correspond to the two parts of union all. But the final fetch operator (Stage 0) only depends on one of the stages, but it should depend on both.
> Plan:
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-2 is a root stage
> *Stage-0 depends on stages: Stage-1*
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Map Operator Tree:
> TableScan
> alias: test_table
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: column1 (type: string)
> outputColumnNames: column1
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> keys: column1 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: string)
> sort order: +
> Map-reduce partition columns: _col0 (type: string)
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Execution mode: vectorized
> Reduce Operator Tree:
> Group By Operator
> keys: KEY._col0 (type: string)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-2
> Map Reduce
> Map Operator Tree:
> TableScan
> alias: test_table
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: column1 (type: string)
> outputColumnNames: column1
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> keys: column1 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: string)
> sort order: +
> Map-reduce partition columns: _col0 (type: string)
> Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE
> Execution mode: vectorized
> Reduce Operator Tree:
> Group By Operator
> keys: KEY._col0 (type: string)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)