You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Charles Chen (JIRA)" <ji...@apache.org> on 2011/08/17 21:05:27 UTC
[jira] [Updated] (HIVE-2382) Invalid predicate pushdown from incorrect column expression map for select operator generated by GROUP BY operation

     [ https://issues.apache.org/jira/browse/HIVE-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charles Chen updated HIVE-2382:
-------------------------------

    Description: 
When a GROUP BY is specified, a select operator is added before the GROUP BY in SemanticAnalyzer.insertSelectAllPlanForGroupBy.  Currently, the column expression map for this is set to the column expression map for the parent operator.  This behavior is incorrect as, for example, the parent operator could rearrange the order of the columns (_col0 => _col0, _col1 => _col2, _col2 => _col1) and the new operator should not repeat this.

The predicate pushdown optimization uses the column expression map to track which columns a filter expression refers to at different operators.  This results in a filter on incorrect columns.

Here is a simple case of this going wrong: Using
{noformat}
create table invites (id int, foo int, bar int);
{noformat}
executing the query
{noformat}
explain select * from (select foo, bar from (select bar, foo from invites c union all select bar, foo from invites d) b) a group by bar, foo having bar=1;
{noformat}
results in
{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a-subquery1:b-subquery1:c 
          TableScan
            alias: c
            Filter Operator
              predicate:
                  expr: (foo = 1)
                  type: boolean
              Select Operator
                expressions:
                      expr: bar
                      type: int
                      expr: foo
                      type: int
                outputColumnNames: _col0, _col1
                Union
                  Select Operator
                    expressions:
                          expr: _col1
                          type: int
                          expr: _col0
                          type: int
                    outputColumnNames: _col0, _col1
                    Select Operator
                      expressions:
                            expr: _col0
                            type: int
                            expr: _col1
                            type: int
                      outputColumnNames: _col0, _col1
                      Group By Operator
                        bucketGroup: false
                        keys:
                              expr: _col1
                              type: int
                              expr: _col0
                              type: int
                        mode: hash
                        outputColumnNames: _col0, _col1
                        Reduce Output Operator
                          key expressions:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          sort order: ++
                          Map-reduce partition columns:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          tag: -1
        a-subquery2:b-subquery2:d 
          TableScan
            alias: d
            Filter Operator
              predicate:
                  expr: (foo = 1)
                  type: boolean
              Select Operator
                expressions:
                      expr: bar
                      type: int
                      expr: foo
                      type: int
                outputColumnNames: _col0, _col1
                Union
                  Select Operator
                    expressions:
                          expr: _col1
                          type: int
                          expr: _col0
                          type: int
                    outputColumnNames: _col0, _col1
                    Select Operator
                      expressions:
                            expr: _col0
                            type: int
                            expr: _col1
                            type: int
                      outputColumnNames: _col0, _col1
                      Group By Operator
                        bucketGroup: false
                        keys:
                              expr: _col1
                              type: int
                              expr: _col0
                              type: int
                        mode: hash
                        outputColumnNames: _col0, _col1
                        Reduce Output Operator
                          key expressions:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          sort order: ++
                          Map-reduce partition columns:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          tag: -1
      Reduce Operator Tree:
        Group By Operator
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
                expr: KEY._col1
                type: int
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: int
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}
Note that the filter is now "foo = 1", while the correct behavior is to have "bar = 1".  If we remove the group by, the behavior is correct.

  was:
When a GROUP BY is specified, a select operator is added before the GROUP BY in SemanticAnalyzer.insertSelectAllPlanForGroupBy.  Currently, the column expression map for this is set to the column expression map for the parent operator.  This behavior is incorrect as, for example, the parent operator could rearrange the order of the columns (_col0 => _col0, _col1 => _col2, _col2 => _col1) and the new operator should not repeat this.

The predicate pushdown optimization uses the column expression map to track which columns a filter expression refers to at different operators.  This results in a filter on incorrect columns.

Here is a simple case of this going wrong: Using
{noformat}
create table invites (id int, foo int, bar int)
{noformat}
executing the query
{noformat}
explain select * from (select foo, bar from (select bar, foo from invites c union all select bar, foo from invites d) b) a group by bar, foo having bar=1;
{noformat}
results in
{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a-subquery1:b-subquery1:c 
          TableScan
            alias: c
            Filter Operator
              predicate:
                  expr: (foo = 1)
                  type: boolean
              Select Operator
                expressions:
                      expr: bar
                      type: int
                      expr: foo
                      type: int
                outputColumnNames: _col0, _col1
                Union
                  Select Operator
                    expressions:
                          expr: _col1
                          type: int
                          expr: _col0
                          type: int
                    outputColumnNames: _col0, _col1
                    Select Operator
                      expressions:
                            expr: _col0
                            type: int
                            expr: _col1
                            type: int
                      outputColumnNames: _col0, _col1
                      Group By Operator
                        bucketGroup: false
                        keys:
                              expr: _col1
                              type: int
                              expr: _col0
                              type: int
                        mode: hash
                        outputColumnNames: _col0, _col1
                        Reduce Output Operator
                          key expressions:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          sort order: ++
                          Map-reduce partition columns:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          tag: -1
        a-subquery2:b-subquery2:d 
          TableScan
            alias: d
            Filter Operator
              predicate:
                  expr: (foo = 1)
                  type: boolean
              Select Operator
                expressions:
                      expr: bar
                      type: int
                      expr: foo
                      type: int
                outputColumnNames: _col0, _col1
                Union
                  Select Operator
                    expressions:
                          expr: _col1
                          type: int
                          expr: _col0
                          type: int
                    outputColumnNames: _col0, _col1
                    Select Operator
                      expressions:
                            expr: _col0
                            type: int
                            expr: _col1
                            type: int
                      outputColumnNames: _col0, _col1
                      Group By Operator
                        bucketGroup: false
                        keys:
                              expr: _col1
                              type: int
                              expr: _col0
                              type: int
                        mode: hash
                        outputColumnNames: _col0, _col1
                        Reduce Output Operator
                          key expressions:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          sort order: ++
                          Map-reduce partition columns:
                                expr: _col0
                                type: int
                                expr: _col1
                                type: int
                          tag: -1
      Reduce Operator Tree:
        Group By Operator
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
                expr: KEY._col1
                type: int
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: int
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}
Note that the filter is now "foo = 1", while the correct behavior is to have "bar = 1".


> Invalid predicate pushdown from incorrect column expression map for select operator generated by GROUP BY operation
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2382
>                 URL: https://issues.apache.org/jira/browse/HIVE-2382
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Charles Chen
>            Assignee: Charles Chen
>            Priority: Critical
>
> When a GROUP BY is specified, a select operator is added before the GROUP BY in SemanticAnalyzer.insertSelectAllPlanForGroupBy.  Currently, the column expression map for this is set to the column expression map for the parent operator.  This behavior is incorrect as, for example, the parent operator could rearrange the order of the columns (_col0 => _col0, _col1 => _col2, _col2 => _col1) and the new operator should not repeat this.
> The predicate pushdown optimization uses the column expression map to track which columns a filter expression refers to at different operators.  This results in a filter on incorrect columns.
> Here is a simple case of this going wrong: Using
> {noformat}
> create table invites (id int, foo int, bar int);
> {noformat}
> executing the query
> {noformat}
> explain select * from (select foo, bar from (select bar, foo from invites c union all select bar, foo from invites d) b) a group by bar, foo having bar=1;
> {noformat}
> results in
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a-subquery1:b-subquery1:c 
>           TableScan
>             alias: c
>             Filter Operator
>               predicate:
>                   expr: (foo = 1)
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: bar
>                       type: int
>                       expr: foo
>                       type: int
>                 outputColumnNames: _col0, _col1
>                 Union
>                   Select Operator
>                     expressions:
>                           expr: _col1
>                           type: int
>                           expr: _col0
>                           type: int
>                     outputColumnNames: _col0, _col1
>                     Select Operator
>                       expressions:
>                             expr: _col0
>                             type: int
>                             expr: _col1
>                             type: int
>                       outputColumnNames: _col0, _col1
>                       Group By Operator
>                         bucketGroup: false
>                         keys:
>                               expr: _col1
>                               type: int
>                               expr: _col0
>                               type: int
>                         mode: hash
>                         outputColumnNames: _col0, _col1
>                         Reduce Output Operator
>                           key expressions:
>                                 expr: _col0
>                                 type: int
>                                 expr: _col1
>                                 type: int
>                           sort order: ++
>                           Map-reduce partition columns:
>                                 expr: _col0
>                                 type: int
>                                 expr: _col1
>                                 type: int
>                           tag: -1
>         a-subquery2:b-subquery2:d 
>           TableScan
>             alias: d
>             Filter Operator
>               predicate:
>                   expr: (foo = 1)
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: bar
>                       type: int
>                       expr: foo
>                       type: int
>                 outputColumnNames: _col0, _col1
>                 Union
>                   Select Operator
>                     expressions:
>                           expr: _col1
>                           type: int
>                           expr: _col0
>                           type: int
>                     outputColumnNames: _col0, _col1
>                     Select Operator
>                       expressions:
>                             expr: _col0
>                             type: int
>                             expr: _col1
>                             type: int
>                       outputColumnNames: _col0, _col1
>                       Group By Operator
>                         bucketGroup: false
>                         keys:
>                               expr: _col1
>                               type: int
>                               expr: _col0
>                               type: int
>                         mode: hash
>                         outputColumnNames: _col0, _col1
>                         Reduce Output Operator
>                           key expressions:
>                                 expr: _col0
>                                 type: int
>                                 expr: _col1
>                                 type: int
>                           sort order: ++
>                           Map-reduce partition columns:
>                                 expr: _col0
>                                 type: int
>                                 expr: _col1
>                                 type: int
>                           tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: int
>                 expr: KEY._col1
>                 type: int
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: int
>                   expr: _col1
>                   type: int
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> Note that the filter is now "foo = 1", while the correct behavior is to have "bar = 1".  If we remove the group by, the behavior is correct.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira