You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Ted Xu (JIRA)" <ji...@apache.org> on 2010/05/12 10:42:41 UTC
[jira] Created: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Predicate push down get error result when sub-queries have the same alias name
-------------------------------------------------------------------------------
Key: HIVE-1342
URL: https://issues.apache.org/jira/browse/HIVE-1342
Project: Hadoop Hive
Issue Type: Bug
Components: Query Processor
Affects Versions: 0.5.0, 0.4.2
Reporter: Ted Xu
Priority: Critical
Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
-------------------------------
create table if not exists dm_fact_buyer_prd_info_d (
category_id string
,gmv_trade_num int
,user_id int
)
PARTITIONED BY (ds int);
set hive.optimize.ppd=true;
set hive.map.aggr=true;
explain select category_id1,category_id2,assoc_idx
from (
select
category_id1
, category_id2
, count(distinct user_id) as assoc_idx
from (
select
t1.category_id as category_id1
, t2.category_id as category_id2
, t1.user_id
from (
select category_id, user_id
from dm_fact_buyer_prd_info_d
group by category_id, user_id ) t1
join (
select category_id, user_id
from dm_fact_buyer_prd_info_d
group by category_id, user_id ) t2 on t1.user_id=t2.user_id
) t1
group by category_id1, category_id2 ) t_o
where category_id1 <> category_id2
and assoc_idx > 2;
-----------------------------
The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
-------------------------------
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t_o:t1:t1:dm_fact_buyer_prd_info_d
TableScan
alias: dm_fact_buyer_prd_info_d
Filter Operator
predicate:
expr: *(category_id <> user_id)*
type: boolean
Select Operator
expressions:
expr: category_id
type: string
expr: user_id
type: bigint
outputColumnNames: category_id, user_id
Group By Operator
keys:
expr: category_id
type: string
expr: user_id
type: bigint
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: bigint
sort order: ++
Map-reduce partition columns:
expr: _col0
type: string
expr: _col1
type: bigint
tag: -1
Reduce Operator Tree:
Group By Operator
keys:
expr: KEY._col0
type: string
expr: KEY._col1
type: bigint
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
----------------------------------
If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
*Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "John Sichi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910336#action_12910336 ]
John Sichi commented on HIVE-1342:
----------------------------------
Finallly got back to this one. Let me provide some specific examples to better explain what I wrote.
First, latest trunk without any patch.
{noformat}
-- Q1.trunk: Without a nested select, the plan is correct for this query.
-- (we're not allowed to push filter down into null-generating side of outer join)
hive> explain
> SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo
> WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Filter Operator
predicate:
expr: (_col5 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q2.trunk: For this equivalent query written using a nested select, the plan is incorrect.
-- (filter got pushed down when it shouldn't; note that in the wrapping select, a.bar should resolve to b.bar in the nested select)
hive> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
a:b
TableScan
alias: b
Filter Operator
predicate:
expr: (bar = 3)
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q3.trunk: However, for this semantically different case, the plan is correct.
-- (we're allowed to push the filter down for an inner join)
hive>
> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, a.bar
> FROM pokes a JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Filter Operator
predicate:
expr: (bar = 3)
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col4
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
Now, repeating Q1/Q2/Q3 with the patch:
{noformat}
-- Q1.patch: this plan is good
-- (same result as Q1.trunk, as expected)
hive> explain
> SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo
> WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Filter Operator
predicate:
expr: (_col5 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q2.patch: this time, the plan is good (no pushdown)
-- (the patch fixes the bug exhibited on trunk)
hive> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q3.patch: whoops, now the plan is valid but suboptimal since the filter pushdown did not happen
-- (whereas it did with trunk)
hive> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, a.bar
> FROM pokes a JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col4
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
So, we need a patch which takes care of Q2 while not causing a plan optimality regression for Q3.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.7.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "John Sichi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Sichi updated HIVE-1342:
-----------------------------
Status: Open (was: Patch Available)
Assignee: Ted Xu
Hi Ted,
I don't think this patch is general enough. To really fix the problem, it will be necessary to dig into it deeper and find out where we are currently using unscoped aliases (e.g. t2) where we should instead be using scoped aliases (e.g. t_o:t1:t2:dm_fact_buyer_prd_info_d). I suspect that if you get to the root of the problem, the fix will take care of HIVE-1395 too.
Also, when submitting a patch, please run diff from the hive trunk directory (not from a subdirectory such as ql).
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882203#action_12882203 ]
He Yongqiang commented on HIVE-1342:
------------------------------------
It maybe better if we throw an error message when we see duplicate alias name.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "Ted Xu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885101#action_12885101 ]
Ted Xu commented on HIVE-1342:
------------------------------
The patch is not simply disables PPD, when encountered the special case (nested select over join) . It prevents replicated table resolve.
I tried the query above and it seems fine with the patch, that is, the predicate can be pushed into the subquery. The explain result is shown below:
{code}
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
z:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: string
sort order: +
Map-reduce partition columns:
expr: foo
type: string
tag: 0
value expressions:
expr: foo
type: string
expr: bar
type: string
z:b
TableScan
alias: b
Filter Operator
predicate:
expr: (UDFToDouble(foo) = UDFToDouble(3))
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: string
sort order: +
Map-reduce partition columns:
expr: foo
type: string
tag: 1
value expressions:
expr: foo
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: string
expr: _col2
type: string
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (UDFToDouble(_col2) = UDFToDouble(3))
type: boolean
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{code}
I think the reason why trunk version cannot push predicate into the subquery is that it did a replicated table resolve therefore can't find any table suitable for that predicate, not disabling PPD purposely.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "Ted Xu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Xu updated HIVE-1342:
-------------------------
Attachment: ppd_same_alias_2.patch
Hi John,
Thank you for reviewing the patch. I updated the patch to solve HIVE-1395.
I dig into the code and find that the same alias in different subqueries can be ambiguous only if PPD is parsing CommonJoinOperator, so I just add some special case in PPD for CommonJoinOperator.
As you mentioned above, adding namespace to RowResolver or OpParseContext can also fix it, but I think we better keep their implementation simple.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "Ted Xu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Xu updated HIVE-1342:
-------------------------
Status: Patch Available (was: Open)
Affects Version/s: 0.6.0
(was: 0.5.0)
(was: 0.4.2)
Fix Version/s: 0.6.0
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "John Sichi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910339#action_12910339 ]
John Sichi commented on HIVE-1342:
----------------------------------
For easy copy/paste into CLI, here are the three queries by themselves.
{noformat}
-- Q1
explain
SELECT a.foo as foo1, b.foo as foo2, b.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo
WHERE b.bar=3;
-- Q2
explain
SELECT * FROM
(SELECT a.foo as foo1, b.foo as foo2, b.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo) a
WHERE a.bar=3;
-- Q3
explain
SELECT * FROM
(SELECT a.foo as foo1, b.foo as foo2, a.bar
FROM pokes a JOIN pokes2 b
ON a.foo=b.foo) a
WHERE a.bar=3;
{noformat}
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.7.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "Ted Xu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Xu updated HIVE-1342:
-------------------------
Attachment: cmd.hql
explain
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.4.2, 0.5.0
> Reporter: Ted Xu
> Priority: Critical
> Attachments: cmd.hql, explain
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "John Sichi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882236#action_12882236 ]
John Sichi commented on HIVE-1342:
----------------------------------
Commentary on duplicate aliases in HIVE-1395.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "Ted Xu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Xu updated HIVE-1342:
-------------------------
Attachment: ppd_same_alias_1.patch
I think PPD is unnecessarily resolving table aliases when encountered CommonJoinOperator.
I attached a patch fixing it. Please have a review.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.4.2, 0.5.0
> Reporter: Ted Xu
> Priority: Critical
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Steinbach updated HIVE-1342:
---------------------------------
Fix Version/s: 0.7.0
(was: 0.6.0)
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.7.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
Posted by "John Sichi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884796#action_12884796 ]
John Sichi commented on HIVE-1342:
----------------------------------
Hmmm....I looked into this one some more. Let me summarize what I found.
On trunk as it is today (without this patch), predicate pushdown does not (in general) get optimized when we have a nested select with a join. For example:
{noformat}
explain
SELECT * FROM (
SELECT a.foo as foo1, b.foo as foo2, a.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo) z
WHERE bar=3;
...
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
z:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
z:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col2
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
However, it does kick in (sometimes correctly, sometimes incorrectly) in the special case where aliases are reused. For example, it happens to work correctly for a query like this:
{noformat}
explain
SELECT * FROM (
SELECT a.foo as foo1, b.foo as foo2, a.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo) a
WHERE a.bar=3;
{noformat}
But in cases like the original ones in the bug reports, it gets applied incorrectly.
Ted's patch attempts to limit the damage by uniformly preventing the optimization from applying for the pattern of nested select over join (regardless of whether aliases have been reused).
If this is the best we can do for 0.6, then we'll have to live with that and then open another issue for correcting the real problem so that we can get full optimization (particularly for views).
I don't think it's a question of keeping the implementation simple; the patch as is does not fix the optimization, it just disables it.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.