You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "John Sichi (JIRA)" <ji...@apache.org> on 2010/09/16 23:49:33 UTC

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910336#action_12910336 ] 

John Sichi commented on HIVE-1342:
----------------------------------

Finallly got back to this one.  Let me provide some specific examples to better explain what I wrote.

First, latest trunk without any patch.

{noformat}
-- Q1.trunk:  Without a nested select, the plan is correct for this query.
-- (we're not allowed to push filter down into null-generating side of outer join)
hive> explain
    > SELECT a.foo as foo1, b.foo as foo2, b.bar
    > FROM pokes a LEFT OUTER JOIN pokes2 b
    > ON a.foo=b.foo
    > WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Filter Operator
            predicate:
                expr: (_col5 = 3)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col4
                    type: int
                    expr: _col5
                    type: string
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q2.trunk:  For this equivalent query written using a nested select, the plan is incorrect.
-- (filter got pushed down when it shouldn't; note that in the wrapping select, a.bar should resolve to b.bar in the nested select)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, b.bar
    >     FROM pokes a LEFT OUTER JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        a:b 
          TableScan
            alias: b
            Filter Operator
              predicate:
                  expr: (bar = 3)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: foo
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: foo
                      type: int
                tag: 1
                value expressions:
                      expr: foo
                      type: int
                      expr: bar
                      type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col5
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q3.trunk:  However, for this semantically different case, the plan is correct.
-- (we're allowed to push the filter down for an inner join)
hive> 
    > explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, a.bar
    >     FROM pokes a JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Filter Operator
              predicate:
                  expr: (bar = 3)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: foo
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: foo
                      type: int
                tag: 0
                value expressions:
                      expr: foo
                      type: int
                      expr: bar
                      type: string
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

Now, repeating Q1/Q2/Q3 with the patch:

{noformat}

-- Q1.patch:  this plan is good
-- (same result as Q1.trunk, as expected)
hive> explain
    > SELECT a.foo as foo1, b.foo as foo2, b.bar
    > FROM pokes a LEFT OUTER JOIN pokes2 b
    > ON a.foo=b.foo
    > WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Filter Operator
            predicate:
                expr: (_col5 = 3)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col4
                    type: int
                    expr: _col5
                    type: string
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q2.patch:  this time, the plan is good (no pushdown)
-- (the patch fixes the bug exhibited on trunk)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, b.bar
    >     FROM pokes a LEFT OUTER JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col5
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q3.patch:  whoops, now the plan is valid but suboptimal since the filter pushdown did not happen
-- (whereas it did with trunk)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, a.bar
    >     FROM pokes a JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

So, we need a patch which takes care of Q2 while not causing a plan optimality regression for Q3.


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.7.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.