You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Richard <co...@163.com> on 2013/01/23 04:45:13 UTC

how may map-reduce needed in a hive query

I am wondering how to determine the number of map-reduce for a hive query.


for example, the following query


select 
sum(c1),
sum(c2),
k1
from
{
select transform(*) using 'mymapper'  as c1, c2, k1
from t1
} a group by k1; 


when i run this query, it takes two map-reduce, but I expect it to take only 1.
in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer.


so why hive takes 2 map-reduce?

Re: how may map-reduce needed in a hive query

Posted by Nitin Pawar <ni...@gmail.com>.
you can run explain extended (your query) to get more details




On Wed, Jan 23, 2013 at 9:15 AM, Richard <co...@163.com> wrote:

> I am wondering how to determine the number of map-reduce for a hive query.
>
> for example, the following query
>
> select
> sum(c1),
> sum(c2),
> k1
> from
> {
> select transform(*) using 'mymapper'  as c1, c2, k1
> from t1
> } a group by k1;
>
> when i run this query, it takes two map-reduce, but I expect it to take
> only 1.
> in the map stage, using 'mymapper' as the mapper, then shuffle the mapper
> output by k1 and perform sum reduce in the reducer.
>
> so why hive takes 2 map-reduce?
>
>
>


-- 
Nitin Pawar

Re: how may map-reduce needed in a hive query

Posted by Nitin Pawar <ni...@gmail.com>.
if you look closely in first phase it executes your transform and in second
it does your sum operation


On Wed, Jan 23, 2013 at 11:24 AM, Richard <co...@163.com> wrote:

> thanks. I used explain command and get the plan, but I am still confused.
> The below is the description of two map-reduce stages:
>
> it seems that in stage-1 the aggregation has already been done, why
> stage-2 has aggregation again?
>
>
> ==========================
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a:t1
>           TableScan
>             alias: t1
>             Select Operator
>               expressions:
>          &nbs p;          expr: f
>                     type: string
>               outputColumnNames: _col0
>               Transform Operator
>                 command: mymapper
>                 output info:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                  Select Operator
>                   expressions:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: string
>                         expr: _col2
> &n bsp;                       type: string
>                   outputColumnNames: _col0, _col1, _col2
>                   Group By Operator
>                     aggregations:
>                           expr: sum(_col0)
>                           expr: sum(_col1)
>                 &nbsp ;   bucketGroup: false
>                     keys:
>                           expr: _col2
>                           type: string
>                     mode: hash
>                     outputColumnNames: _col0, _col1, _col2
>                     Reduce Output Operator
>             &nbsp ;         key expressions:
>                             expr: _col0
>                             type: string
>                       sort order: +
>                       Map-reduce partition columns:
>                             expr: rand()
>               &nb sp;             type: double
>                       tag: -1
>                       value expressions:
>                             expr: _col1
>                             type: double
>                             expr: _col2
>              &nbsp ;              type: double
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: sum(VALUE._col0)
>                 expr: sum(VALUE._col1)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>           mode: partials
>        &n bsp;  outputColumnNames: _col0, _col1, _col2
>           File Output Operator
>             compressed: false
>             GlobalTableId: 0
>             table:
>                 input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                 output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
>   Stage: Stage-2
>     Map Reduce
>       Alias -> Map Operator Tree:
>
> hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002
>     & nbsp;       Reduce Output Operator
>               key expressions:
>                     expr: _col0
>                     type: string
>               sort order: +
>               Map-reduce partition columns:
>                     expr: _col0
>                     type: string
>               tag: -1
>  &nbs p;            value expressions:
>                     expr: _col1
>                     type: double
>                     expr: _col2
>                     type: double
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: sum(VALUE._col0)
>             &nb sp;   expr: sum(VALUE._col1)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>           mode: final
>           outputColumnNames: _col0, _col1, _col2
>           Select Operator
>             expressions:
>                   expr: _col1
>                   type: double
>                    expr: _col2
>                   type: double
>                   expr: _col0
>                   type: string
>             outputColumnNames: _col0, _col1, _col2
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>            &nbsp ;      input format:
> org.apache.hadoop.mapred.TextInputFormat
>                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> ============================
>
>
>
>
>
>
> At 2013-01-23 11:45:13,Richard <co...@163.com> wrote:
>
> I am wondering how to determine the number of map-reduce for a hive query.
>
> for example, the following query
>
> select
> sum(c1),
> sum(c2),
> k1
> from
> {
> select transform(*) using 'mymapper'  as c1, c2, k1
> from t 1
> } a group by k1;
>
> when i run this query, it takes two map-reduce, but I expect it to take
> only 1.
> in the map stage, using 'mymapper' as the mapper, then shuffle the mapper
> output by k1 and perform sum reduce in the reducer.
>
> so why hive takes 2 map-reduce?
>
>
>
>
>


-- 
Nitin Pawar

Re:how may map-reduce needed in a hive query

Posted by Richard <co...@163.com>.
thanks. I used explain command and get the plan, but I am still confused.
The below is the description of two map-reduce stages:


it seems that in stage-1 the aggregation has already been done, why stage-2 has aggregation again?




==========================
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:t1 
          TableScan
            alias: t1
            Select Operator
              expressions:
                    expr: f
                    type: string
              outputColumnNames: _col0
              Transform Operator
                command: mymapper
                output info:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                Select Operator
                  expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: string
                        expr: _col2
                        type: string
                  outputColumnNames: _col0, _col1, _col2
                  Group By Operator
                    aggregations:
                          expr: sum(_col0)
                          expr: sum(_col1)
                    bucketGroup: false
                    keys:
                          expr: _col2
                          type: string
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: rand()
                            type: double
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: double
                            expr: _col2
                            type: double
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(VALUE._col0)
                expr: sum(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: partials
          outputColumnNames: _col0, _col1, _col2
          File Output Operator
            compressed: false
            GlobalTableId: 0
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002 
            Reduce Output Operator
              key expressions:
                    expr: _col0
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: _col0
                    type: string
              tag: -1
              value expressions:
                    expr: _col1
                    type: double
                    expr: _col2
                    type: double
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(VALUE._col0)
                expr: sum(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: final
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col1
                  type: double
                  expr: _col2
                  type: double
                  expr: _col0
                  type: string
            outputColumnNames: _col0, _col1, _col2
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

============================








At 2013-01-23 11:45:13,Richard <co...@163.com> wrote:

I am wondering how to determine the number of map-reduce for a hive query.


for example, the following query


select 
sum(c1),
sum(c2),
k1
from
{
select transform(*) using 'mymapper'  as c1, c2, k1
from t1
} a group by k1; 


when i run this query, it takes two map-reduce, but I expect it to take only 1.
in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer.


so why hive takes 2 map-reduce?