You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ted Xu (JIRA)" <ji...@apache.org> on 2010/05/12 10:42:41 UTC

[jira] Created: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Predicate push down get error result when sub-queries have the same alias name 
-------------------------------------------------------------------------------

                 Key: HIVE-1342
                 URL: https://issues.apache.org/jira/browse/HIVE-1342
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.5.0, 0.4.2
            Reporter: Ted Xu
            Priority: Critical


Query is over-optimized by PPD when sub-queries have the same alias name, see the query:

-------------------------------
create table if not exists dm_fact_buyer_prd_info_d (
		category_id string
		,gmv_trade_num  int
		,user_id    int
		)
PARTITIONED BY (ds int);

set hive.optimize.ppd=true;
set hive.map.aggr=true;

explain select category_id1,category_id2,assoc_idx
from (
		select 
			category_id1
			, category_id2
			, count(distinct user_id) as assoc_idx
		from (
			select 
				t1.category_id as category_id1
				, t2.category_id as category_id2
				, t1.user_id
			from (
				select category_id, user_id
				from dm_fact_buyer_prd_info_d
				group by category_id, user_id ) t1
			join (
				select category_id, user_id
				from dm_fact_buyer_prd_info_d
				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
			) t1
			group by category_id1, category_id2 ) t_o
			where category_id1 <> category_id2
			and assoc_idx > 2;

-----------------------------
The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 

I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):

-------------------------------
Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        t_o:t1:t1:dm_fact_buyer_prd_info_d 
          TableScan
            alias: dm_fact_buyer_prd_info_d
            Filter Operator
              predicate:
                  expr: *(category_id <> user_id)*
                  type: boolean
              Select Operator
                expressions:
                      expr: category_id
                      type: string
                      expr: user_id
                      type: bigint
                outputColumnNames: category_id, user_id
                Group By Operator
                  keys:
                        expr: category_id
                        type: string
                        expr: user_id
                        type: bigint
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: bigint
                    sort order: ++
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: bigint
                    tag: -1
      Reduce Operator Tree:
        Group By Operator
          keys:
                expr: KEY._col0
                type: string
                expr: KEY._col1
                type: bigint
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 ----------------------------------

If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 

*Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910336#action_12910336 ] 

John Sichi commented on HIVE-1342:
----------------------------------

Finallly got back to this one.  Let me provide some specific examples to better explain what I wrote.

First, latest trunk without any patch.

{noformat}
-- Q1.trunk:  Without a nested select, the plan is correct for this query.
-- (we're not allowed to push filter down into null-generating side of outer join)
hive> explain
    > SELECT a.foo as foo1, b.foo as foo2, b.bar
    > FROM pokes a LEFT OUTER JOIN pokes2 b
    > ON a.foo=b.foo
    > WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Filter Operator
            predicate:
                expr: (_col5 = 3)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col4
                    type: int
                    expr: _col5
                    type: string
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q2.trunk:  For this equivalent query written using a nested select, the plan is incorrect.
-- (filter got pushed down when it shouldn't; note that in the wrapping select, a.bar should resolve to b.bar in the nested select)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, b.bar
    >     FROM pokes a LEFT OUTER JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        a:b 
          TableScan
            alias: b
            Filter Operator
              predicate:
                  expr: (bar = 3)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: foo
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: foo
                      type: int
                tag: 1
                value expressions:
                      expr: foo
                      type: int
                      expr: bar
                      type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col5
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q3.trunk:  However, for this semantically different case, the plan is correct.
-- (we're allowed to push the filter down for an inner join)
hive> 
    > explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, a.bar
    >     FROM pokes a JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Filter Operator
              predicate:
                  expr: (bar = 3)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: foo
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: foo
                      type: int
                tag: 0
                value expressions:
                      expr: foo
                      type: int
                      expr: bar
                      type: string
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

Now, repeating Q1/Q2/Q3 with the patch:

{noformat}

-- Q1.patch:  this plan is good
-- (same result as Q1.trunk, as expected)
hive> explain
    > SELECT a.foo as foo1, b.foo as foo2, b.bar
    > FROM pokes a LEFT OUTER JOIN pokes2 b
    > ON a.foo=b.foo
    > WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Filter Operator
            predicate:
                expr: (_col5 = 3)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col4
                    type: int
                    expr: _col5
                    type: string
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q2.patch:  this time, the plan is good (no pushdown)
-- (the patch fixes the bug exhibited on trunk)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, b.bar
    >     FROM pokes a LEFT OUTER JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col5
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q3.patch:  whoops, now the plan is valid but suboptimal since the filter pushdown did not happen
-- (whereas it did with trunk)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, a.bar
    >     FROM pokes a JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

So, we need a patch which takes care of Q2 while not causing a plan optimality regression for Q3.


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.7.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-1342:
-----------------------------

      Status: Open  (was: Patch Available)
    Assignee: Ted Xu

Hi Ted,

I don't think this patch is general enough.  To really fix the problem, it will be necessary to dig into it deeper and find out where we are currently using unscoped aliases (e.g. t2) where we should instead be using scoped aliases (e.g. t_o:t1:t2:dm_fact_buyer_prd_info_d).  I suspect that if you get to the root of the problem, the fix will take care of HIVE-1395 too.

Also, when submitting a patch, please run diff from the hive trunk directory (not from a subdirectory such as ql).


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882203#action_12882203 ] 

He Yongqiang commented on HIVE-1342:
------------------------------------

It maybe better if we throw an error message when we see duplicate alias name.

> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885101#action_12885101 ] 

Ted Xu commented on HIVE-1342:
------------------------------

The patch is not simply disables PPD, when encountered the special case (nested select over join) . It prevents replicated table resolve.

I tried the query above and it seems fine with the patch, that is, the predicate can be pushed into the subquery. The explain result is shown below:

{code}
STAGE PLANS:

  Stage: Stage-1
    Map Reduce

      Alias -> Map Operator Tree:

        z:a 

          TableScan

            alias: a

            Reduce Output Operator

              key expressions:

                    expr: foo

                    type: string

              sort order: +

              Map-reduce partition columns:

                    expr: foo

                    type: string

              tag: 0

              value expressions:

                    expr: foo

                    type: string

                    expr: bar

                    type: string

        z:b 

          TableScan

            alias: b

            Filter Operator

              predicate:

                  expr: (UDFToDouble(foo) = UDFToDouble(3))

                  type: boolean

              Reduce Output Operator

                key expressions:

                      expr: foo

                      type: string

                sort order: +

                Map-reduce partition columns:

                      expr: foo

                      type: string

                tag: 1

                value expressions:

                      expr: foo

                      type: string

      Reduce Operator Tree:

        Join Operator

          condition map:

               Left Outer Join0 to 1

          condition expressions:

            0 {VALUE._col0} {VALUE._col1}

            1 {VALUE._col0}

          outputColumnNames: _col0, _col1, _col2

          Select Operator

            expressions:

                  expr: _col0

                  type: string

                  expr: _col2

                  type: string

                  expr: _col1

                  type: string

            outputColumnNames: _col0, _col1, _col2

            Filter Operator

              predicate:

                  expr: (UDFToDouble(_col2) = UDFToDouble(3))

                  type: boolean

              Select Operator

                expressions:

                      expr: _col0

                      type: string

                      expr: _col1

                      type: string

                      expr: _col2

                      type: string

                outputColumnNames: _col0, _col1, _col2

                File Output Operator

                  compressed: false

                  GlobalTableId: 0

                  table:

                      input format: org.apache.hadoop.mapred.TextInputFormat

                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat



  Stage: Stage-0
    Fetch Operator

      limit: -1
{code}

I think the reason why trunk version cannot push predicate into the subquery is that it did a replicated table resolve therefore can't find any table suitable for that predicate, not disabling PPD purposely.


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Xu updated HIVE-1342:
-------------------------

    Attachment: ppd_same_alias_2.patch

Hi John,

Thank you for reviewing the patch. I updated the patch to solve HIVE-1395.

I dig into the code and find that the same alias in different subqueries can be ambiguous only if PPD is parsing CommonJoinOperator, so I just add some special case in PPD for CommonJoinOperator.

As you mentioned above, adding namespace to RowResolver or OpParseContext can also fix it, but I think we better keep their implementation simple.

> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Xu updated HIVE-1342:
-------------------------

               Status: Patch Available  (was: Open)
    Affects Version/s: 0.6.0
                           (was: 0.5.0)
                           (was: 0.4.2)
        Fix Version/s: 0.6.0

> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910339#action_12910339 ] 

John Sichi commented on HIVE-1342:
----------------------------------

For easy copy/paste into CLI, here are the three queries by themselves.

{noformat}
-- Q1
explain
SELECT a.foo as foo1, b.foo as foo2, b.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo
WHERE b.bar=3;

-- Q2
explain
SELECT * FROM
    (SELECT a.foo as foo1, b.foo as foo2, b.bar
    FROM pokes a LEFT OUTER JOIN pokes2 b
    ON a.foo=b.foo) a
WHERE a.bar=3;

-- Q3
explain
SELECT * FROM
    (SELECT a.foo as foo1, b.foo as foo2, a.bar
    FROM pokes a JOIN pokes2 b
    ON a.foo=b.foo) a
WHERE a.bar=3;
{noformat}


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.7.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Xu updated HIVE-1342:
-------------------------

    Attachment: cmd.hql
                explain

> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.4.2, 0.5.0
>            Reporter: Ted Xu
>            Priority: Critical
>         Attachments: cmd.hql, explain
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882236#action_12882236 ] 

John Sichi commented on HIVE-1342:
----------------------------------

Commentary on duplicate aliases in HIVE-1395.


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Xu updated HIVE-1342:
-------------------------

    Attachment: ppd_same_alias_1.patch

I think PPD is unnecessarily resolving table aliases when encountered CommonJoinOperator. 
I attached a patch fixing it. Please have a review.

> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.4.2, 0.5.0
>            Reporter: Ted Xu
>            Priority: Critical
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1342:
---------------------------------

    Fix Version/s: 0.7.0
                       (was: 0.6.0)

> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.7.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884796#action_12884796 ] 

John Sichi commented on HIVE-1342:
----------------------------------

Hmmm....I looked into this one some more.  Let me summarize what I found.

On trunk as it is today (without this patch), predicate pushdown does not (in general) get optimized when we have a nested select with a join.  For example:

{noformat}
explain
SELECT * FROM (
    SELECT a.foo as foo1, b.foo as foo2, a.bar
    FROM pokes a LEFT OUTER JOIN pokes2 b
    ON a.foo=b.foo) z
WHERE bar=3;

...

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        z:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
        z:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col2
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

{noformat}

However, it does kick in (sometimes correctly, sometimes incorrectly) in the special case where aliases are reused.  For example, it happens to work correctly for a query like this:

{noformat}
explain
SELECT * FROM (
    SELECT a.foo as foo1, b.foo as foo2, a.bar
    FROM pokes a LEFT OUTER JOIN pokes2 b
    ON a.foo=b.foo) a
WHERE a.bar=3;
{noformat}

But in cases like the original ones in the bug reports, it gets applied incorrectly.

Ted's patch attempts to limit the damage by uniformly preventing the optimization from applying for the pattern of nested select over join (regardless of whether aliases have been reused).

If this is the best we can do for 0.6, then we'll have to live with that and then open another issue for correcting the real problem so that we can get full optimization (particularly for views).

I don't think it's a question of keeping the implementation simple; the patch as is does not fix the optimization, it just disables it.


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.