You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "John Sichi (JIRA)" <ji...@apache.org> on 2010/09/16 23:49:33 UTC
[jira] Commented: (HIVE-1342) Predicate push down get error result
when sub-queries have the same alias name
[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910336#action_12910336 ]
John Sichi commented on HIVE-1342:
----------------------------------
Finallly got back to this one. Let me provide some specific examples to better explain what I wrote.
First, latest trunk without any patch.
{noformat}
-- Q1.trunk: Without a nested select, the plan is correct for this query.
-- (we're not allowed to push filter down into null-generating side of outer join)
hive> explain
> SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo
> WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Filter Operator
predicate:
expr: (_col5 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q2.trunk: For this equivalent query written using a nested select, the plan is incorrect.
-- (filter got pushed down when it shouldn't; note that in the wrapping select, a.bar should resolve to b.bar in the nested select)
hive> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
a:b
TableScan
alias: b
Filter Operator
predicate:
expr: (bar = 3)
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q3.trunk: However, for this semantically different case, the plan is correct.
-- (we're allowed to push the filter down for an inner join)
hive>
> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, a.bar
> FROM pokes a JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Filter Operator
predicate:
expr: (bar = 3)
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col4
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
Now, repeating Q1/Q2/Q3 with the patch:
{noformat}
-- Q1.patch: this plan is good
-- (same result as Q1.trunk, as expected)
hive> explain
> SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo
> WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Filter Operator
predicate:
expr: (_col5 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q2.patch: this time, the plan is good (no pushdown)
-- (the patch fixes the bug exhibited on trunk)
hive> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
-- Q3.patch: whoops, now the plan is valid but suboptimal since the filter pushdown did not happen
-- (whereas it did with trunk)
hive> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, a.bar
> FROM pokes a JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col4
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
So, we need a patch which takes care of Q2 while not causing a plan optimality regression for Q3.
> Predicate push down get error result when sub-queries have the same alias name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.7.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.