You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Navis Ryu <na...@nexr.com> on 2014/07/04 02:15:21 UTC

Review Request 23270: Wrong results when union all of grouping followed by group by with correlation optimization

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23270/
-----------------------------------------------------------

Review request for hive.


Bugs: HIVE-7205
    https://issues.apache.org/jira/browse/HIVE-7205


Repository: hive-git


Description
-------

use case :

table TBL (a string,b string) contains single row : 'a','a'
the following query :
{code:sql}
select b, sum(cc) from (
        select b,count(1) as cc from TBL group by b
        union all
        select a as b,count(1) as cc from TBL group by a
) z
group by b
{code}
returns 

a 1
a 1
while set hive.optimize.correlation=true;

if we change set hive.optimize.correlation=false;
it returns correct results : a 2


The plan with correlation optimization :
{code:sql}
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b)))) (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL a))))) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL cc)))) (TOK_GROUPBY (TOK_TABLE_OR_COL b))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery1:z-subquery1:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: b
                    type: string
              outputColumnNames: b
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: b
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: 0
                  value expressions:
                        expr: _col1
                        type: bigint
        null-subquery2:z-subquery2:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: a
                    type: string
              outputColumnNames: a
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: a
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: 1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Demux Operator
          Group By Operator
            aggregations:
                  expr: count(VALUE._col0)
            bucketGroup: false
            keys:
                  expr: KEY._col0
                  type: string
            mode: mergepartial
            outputColumnNames: _col0, _col1
            Select Operator
              expressions:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: bigint
              outputColumnNames: _col0, _col1
              Union
                Select Operator
                  expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  outputColumnNames: _col0, _col1
                  Mux Operator
                    Group By Operator
                      aggregations:
                            expr: sum(_col1)
                      bucketGroup: false
                      keys:
                            expr: _col0
                            type: string
                      mode: complete
                      outputColumnNames: _col0, _col1
                      Select Operator
                        expressions:
                              expr: _col0
                              type: string
                              expr: _col1
                              type: bigint
                        outputColumnNames: _col0, _col1
                        File Output Operator
                          compressed: false
                          GlobalTableId: 0
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Group By Operator
            aggregations:
                  expr: count(VALUE._col0)
            bucketGroup: false
            keys:
                  expr: KEY._col0
                  type: string
            mode: mergepartial
            outputColumnNames: _col0, _col1
            Select Operator
              expressions:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: bigint
              outputColumnNames: _col0, _col1
              Union
                Select Operator
                  expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  outputColumnNames: _col0, _col1
                  Mux Operator
                    Group By Operator
                      aggregations:
                            expr: sum(_col1)
                      bucketGroup: false
                      keys:
                            expr: _col0
                            type: string
                      mode: complete
                      outputColumnNames: _col0, _col1
                      Select Operator
                        expressions:
                              expr: _col0
                              type: string
                              expr: _col1
                              type: bigint
                        outputColumnNames: _col0, _col1
                        File Output Operator
                          compressed: false
                          GlobalTableId: 0
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{code}

Plan without correlation optimization :

{code:sql}
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b)))) (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL a))))) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL cc)))) (TOK_GROUPBY (TOK_TABLE_OR_COL b))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery2:z-subquery2:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: a
                    type: string
              outputColumnNames: a
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: a
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        maprfs:/user/hadoop/tmp/hive/hive_2014-06-10_15-42-37_188_2403850033480056671-5/-mr-10002 
          TableScan
            Union
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                Group By Operator
                  aggregations:
                        expr: sum(_col1)
                  bucketGroup: false
                  keys:
                        expr: _col0
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
        maprfs:/user/hadoop/tmp/hive/hive_2014-06-10_15-42-37_188_2403850033480056671-5/-mr-10003 
          TableScan
            Union
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                Group By Operator
                  aggregations:
                        expr: sum(_col1)
                  bucketGroup: false
                  keys:
                        expr: _col0
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-3
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery1:z-subquery1:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: b
                    type: string
              outputColumnNames: b
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: b
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{code}


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java 03194a4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java 772dda6 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1dde78e 
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 792d87f 
  ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableDummyOperator.java 91b2369 
  ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java c747099 
  ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java e877cd4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java b10a7fa 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java db94271 
  ql/src/java/org/apache/hadoop/hive/ql/exec/SkewJoinHandler.java a63466a 
  ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java 58d1638 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ReduceRecordProcessor.java e884afd 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java c52f753 
  ql/src/test/org/apache/hadoop/hive/ql/exec/vector/util/FakeCaptureOutputOperator.java 43458d9 
  ql/src/test/queries/clientpositive/correlationoptimizer16.q PRE-CREATION 
  ql/src/test/results/clientpositive/correlationoptimizer16.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/23270/diff/


Testing
-------


Thanks,

Navis Ryu

Re: Review Request 23270: Wrong results when union all of grouping followed by group by with correlation optimization

Posted by Yin Huai <hu...@cse.ohio-state.edu>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23270/#review47707
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
<https://reviews.apache.org/r/23270/#comment83868>

    What does flush do?



ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
<https://reviews.apache.org/r/23270/#comment83867>

    Why remove this method? The rows in a key group are sorted by tags. If we see a new tag, we can call end group for operators which have smaller tags. Also, the JoinOperator assumes that the rows are sorted by tags. I think we need this method to make sure for the optimized plan, JoinOperator still get rows sorted by tags (within a key group).



ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
<https://reviews.apache.org/r/23270/#comment83873>

    Do we need this?



ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java
<https://reviews.apache.org/r/23270/#comment83877>

    Seems the logic at here is used to check if we are processing the last alias of this JoinOperaotr and because endGroupIfNecessary is removed in this patch, rows within a key group may not sorted by tags. I am not sure if this is what we want because the behavior of the JoinOperator when we have an optimized plan may be different from a not optimized plan. I mean the right most table may not be the stream table for a plan optimized by the correlation optimizer.



ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
<https://reviews.apache.org/r/23270/#comment83874>

    Do we need this?



ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
<https://reviews.apache.org/r/23270/#comment83872>

    Do we need this?



ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
<https://reviews.apache.org/r/23270/#comment83866>

    Seems we do not need this line, right?



ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
<https://reviews.apache.org/r/23270/#comment83869>

    Do we need this?



ql/src/test/queries/clientpositive/correlationoptimizer16.q
<https://reviews.apache.org/r/23270/#comment83870>

    I think correlationoptimizer8 is for cases with UNION ALL. Can we add test queries in that file?


- Yin Huai


On July 4, 2014, 12:15 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23270/
> -----------------------------------------------------------
> 
> (Updated July 4, 2014, 12:15 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-7205
>     https://issues.apache.org/jira/browse/HIVE-7205
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> use case :
> 
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
>         select b,count(1) as cc from TBL group by b
>         union all
>         select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> 
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> 
> 
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b)))) (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL a))))) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL cc)))) (TOK_GROUPBY (TOK_TABLE_OR_COL b))))
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> 
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         null-subquery1:z-subquery1:TBL 
>           TableScan
>             alias: TBL
>             Select Operator
>               expressions:
>                     expr: b
>                     type: string
>               outputColumnNames: b
>               Group By Operator
>                 aggregations:
>                       expr: count(1)
>                 bucketGroup: false
>                 keys:
>                       expr: b
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                   sort order: +
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                   tag: 0
>                   value expressions:
>                         expr: _col1
>                         type: bigint
>         null-subquery2:z-subquery2:TBL 
>           TableScan
>             alias: TBL
>             Select Operator
>               expressions:
>                     expr: a
>                     type: string
>               outputColumnNames: a
>               Group By Operator
>                 aggregations:
>                       expr: count(1)
>                 bucketGroup: false
>                 keys:
>                       expr: a
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                   sort order: +
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                   tag: 1
>                   value expressions:
>                         expr: _col1
>                         type: bigint
>       Reduce Operator Tree:
>         Demux Operator
>           Group By Operator
>             aggregations:
>                   expr: count(VALUE._col0)
>             bucketGroup: false
>             keys:
>                   expr: KEY._col0
>                   type: string
>             mode: mergepartial
>             outputColumnNames: _col0, _col1
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: string
>                     expr: _col1
>                     type: bigint
>               outputColumnNames: _col0, _col1
>               Union
>                 Select Operator
>                   expressions:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   outputColumnNames: _col0, _col1
>                   Mux Operator
>                     Group By Operator
>                       aggregations:
>                             expr: sum(_col1)
>                       bucketGroup: false
>                       keys:
>                             expr: _col0
>                             type: string
>                       mode: complete
>                       outputColumnNames: _col0, _col1
>                       Select Operator
>                         expressions:
>                               expr: _col0
>                               type: string
>                               expr: _col1
>                               type: bigint
>                         outputColumnNames: _col0, _col1
>                         File Output Operator
>                           compressed: false
>                           GlobalTableId: 0
>                           table:
>                               input format: org.apache.hadoop.mapred.TextInputFormat
>                               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>           Group By Operator
>             aggregations:
>                   expr: count(VALUE._col0)
>             bucketGroup: false
>             keys:
>                   expr: KEY._col0
>                   type: string
>             mode: mergepartial
>             outputColumnNames: _col0, _col1
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: string
>                     expr: _col1
>                     type: bigint
>               outputColumnNames: _col0, _col1
>               Union
>                 Select Operator
>                   expressions:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   outputColumnNames: _col0, _col1
>                   Mux Operator
>                     Group By Operator
>                       aggregations:
>                             expr: sum(_col1)
>                       bucketGroup: false
>                       keys:
>                             expr: _col0
>                             type: string
>                       mode: complete
>                       outputColumnNames: _col0, _col1
>                       Select Operator
>                         expressions:
>                               expr: _col0
>                               type: string
>                               expr: _col1
>                               type: bigint
>                         outputColumnNames: _col0, _col1
>                         File Output Operator
>                           compressed: false
>                           GlobalTableId: 0
>                           table:
>                               input format: org.apache.hadoop.mapred.TextInputFormat
>                               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> 
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {code}
> 
> Plan without correlation optimization :
> 
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b)))) (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL a))))) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL cc)))) (TOK_GROUPBY (TOK_TABLE_OR_COL b))))
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1, Stage-3
>   Stage-3 is a root stage
>   Stage-0 is a root stage
> 
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         null-subquery2:z-subquery2:TBL 
>           TableScan
>             alias: TBL
>             Select Operator
>               expressions:
>                     expr: a
>                     type: string
>               outputColumnNames: a
>               Group By Operator
>                 aggregations:
>                       expr: count(1)
>                 bucketGroup: false
>                 keys:
>                       expr: a
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                   sort order: +
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                   tag: -1
>                   value expressions:
>                         expr: _col1
>                         type: bigint
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(VALUE._col0)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> 
>   Stage: Stage-2
>     Map Reduce
>       Alias -> Map Operator Tree:
>         maprfs:/user/hadoop/tmp/hive/hive_2014-06-10_15-42-37_188_2403850033480056671-5/-mr-10002 
>           TableScan
>             Union
>               Select Operator
>                 expressions:
>                       expr: _col0
>                       type: string
>                       expr: _col1
>                       type: bigint
>                 outputColumnNames: _col0, _col1
>                 Group By Operator
>                   aggregations:
>                         expr: sum(_col1)
>                   bucketGroup: false
>                   keys:
>                         expr: _col0
>                         type: string
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                     sort order: +
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                     tag: -1
>                     value expressions:
>                           expr: _col1
>                           type: bigint
>         maprfs:/user/hadoop/tmp/hive/hive_2014-06-10_15-42-37_188_2403850033480056671-5/-mr-10003 
>           TableScan
>             Union
>               Select Operator
>                 expressions:
>                       expr: _col0
>                       type: string
>                       expr: _col1
>                       type: bigint
>                 outputColumnNames: _col0, _col1
>                 Group By Operator
>                   aggregations:
>                         expr: sum(_col1)
>                   bucketGroup: false
>                   keys:
>                         expr: _col0
>                         type: string
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                     sort order: +
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                     tag: -1
>                     value expressions:
>                           expr: _col1
>                           type: bigint
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: sum(VALUE._col0)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> 
>   Stage: Stage-3
>     Map Reduce
>       Alias -> Map Operator Tree:
>         null-subquery1:z-subquery1:TBL 
>           TableScan
>             alias: TBL
>             Select Operator
>               expressions:
>                     expr: b
>                     type: string
>               outputColumnNames: b
>               Group By Operator
>                 aggregations:
>                       expr: count(1)
>                 bucketGroup: false
>                 keys:
>                       expr: b
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                   sort order: +
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                   tag: -1
>                   value expressions:
>                         expr: _col1
>                         type: bigint
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(VALUE._col0)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> 
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {code}
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java 03194a4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java 772dda6 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1dde78e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 792d87f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableDummyOperator.java 91b2369 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java c747099 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java e877cd4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java b10a7fa 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java db94271 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/SkewJoinHandler.java a63466a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java 58d1638 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ReduceRecordProcessor.java e884afd 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java c52f753 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/vector/util/FakeCaptureOutputOperator.java 43458d9 
>   ql/src/test/queries/clientpositive/correlationoptimizer16.q PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer16.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/23270/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>