You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Richard <co...@163.com> on 2013/01/23 04:45:13 UTC
how may map-reduce needed in a hive query
I am wondering how to determine the number of map-reduce for a hive query.
for example, the following query
select
sum(c1),
sum(c2),
k1
from
{
select transform(*) using 'mymapper' as c1, c2, k1
from t1
} a group by k1;
when i run this query, it takes two map-reduce, but I expect it to take only 1.
in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer.
so why hive takes 2 map-reduce?
Re: how may map-reduce needed in a hive query
Posted by Nitin Pawar <ni...@gmail.com>.
you can run explain extended (your query) to get more details
On Wed, Jan 23, 2013 at 9:15 AM, Richard <co...@163.com> wrote:
> I am wondering how to determine the number of map-reduce for a hive query.
>
> for example, the following query
>
> select
> sum(c1),
> sum(c2),
> k1
> from
> {
> select transform(*) using 'mymapper' as c1, c2, k1
> from t1
> } a group by k1;
>
> when i run this query, it takes two map-reduce, but I expect it to take
> only 1.
> in the map stage, using 'mymapper' as the mapper, then shuffle the mapper
> output by k1 and perform sum reduce in the reducer.
>
> so why hive takes 2 map-reduce?
>
>
>
--
Nitin Pawar
Re: how may map-reduce needed in a hive query
Posted by Nitin Pawar <ni...@gmail.com>.
if you look closely in first phase it executes your transform and in second
it does your sum operation
On Wed, Jan 23, 2013 at 11:24 AM, Richard <co...@163.com> wrote:
> thanks. I used explain command and get the plan, but I am still confused.
> The below is the description of two map-reduce stages:
>
> it seems that in stage-1 the aggregation has already been done, why
> stage-2 has aggregation again?
>
>
> ==========================
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> a:t1
> TableScan
> alias: t1
> Select Operator
> expressions:
> &nbs p; expr: f
> type: string
> outputColumnNames: _col0
> Transform Operator
> command: mymapper
> output info:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> expr: _col2
> &n bsp; type: string
> outputColumnNames: _col0, _col1, _col2
> Group By Operator
> aggregations:
> expr: sum(_col0)
> expr: sum(_col1)
>   ; bucketGroup: false
> keys:
> expr: _col2
> type: string
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Reduce Output Operator
>   ; key expressions:
> expr: _col0
> type: string
> sort order: +
> Map-reduce partition columns:
> expr: rand()
> &nb sp; type: double
> tag: -1
> value expressions:
> expr: _col1
> type: double
> expr: _col2
>   ; type: double
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: sum(VALUE._col0)
> expr: sum(VALUE._col1)
> bucketGroup: false
> keys:
> expr: KEY._col0
> type: string
> mode: partials
> &n bsp; outputColumnNames: _col0, _col1, _col2
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
> Stage: Stage-2
> Map Reduce
> Alias -> Map Operator Tree:
>
> hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002
> & nbsp; Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> sort order: +
> Map-reduce partition columns:
> expr: _col0
> type: string
> tag: -1
> &nbs p; value expressions:
> expr: _col1
> type: double
> expr: _col2
> type: double
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: sum(VALUE._col0)
> &nb sp; expr: sum(VALUE._col1)
> bucketGroup: false
> keys:
> expr: KEY._col0
> type: string
> mode: final
> outputColumnNames: _col0, _col1, _col2
> Select Operator
> expressions:
> expr: _col1
> type: double
> expr: _col2
> type: double
> expr: _col0
> type: string
> outputColumnNames: _col0, _col1, _col2
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
>   ; input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> ============================
>
>
>
>
>
>
> At 2013-01-23 11:45:13,Richard <co...@163.com> wrote:
>
> I am wondering how to determine the number of map-reduce for a hive query.
>
> for example, the following query
>
> select
> sum(c1),
> sum(c2),
> k1
> from
> {
> select transform(*) using 'mymapper' as c1, c2, k1
> from t 1
> } a group by k1;
>
> when i run this query, it takes two map-reduce, but I expect it to take
> only 1.
> in the map stage, using 'mymapper' as the mapper, then shuffle the mapper
> output by k1 and perform sum reduce in the reducer.
>
> so why hive takes 2 map-reduce?
>
>
>
>
>
--
Nitin Pawar
Re:how may map-reduce needed in a hive query
Posted by Richard <co...@163.com>.
thanks. I used explain command and get the plan, but I am still confused.
The below is the description of two map-reduce stages:
it seems that in stage-1 the aggregation has already been done, why stage-2 has aggregation again?
==========================
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:t1
TableScan
alias: t1
Select Operator
expressions:
expr: f
type: string
outputColumnNames: _col0
Transform Operator
command: mymapper
output info:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
Group By Operator
aggregations:
expr: sum(_col0)
expr: sum(_col1)
bucketGroup: false
keys:
expr: _col2
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: rand()
type: double
tag: -1
value expressions:
expr: _col1
type: double
expr: _col2
type: double
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
expr: sum(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: partials
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002
Reduce Output Operator
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: double
expr: _col2
type: double
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
expr: sum(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: final
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col1
type: double
expr: _col2
type: double
expr: _col0
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
============================
At 2013-01-23 11:45:13,Richard <co...@163.com> wrote:
I am wondering how to determine the number of map-reduce for a hive query.
for example, the following query
select
sum(c1),
sum(c2),
k1
from
{
select transform(*) using 'mymapper' as c1, c2, k1
from t1
} a group by k1;
when i run this query, it takes two map-reduce, but I expect it to take only 1.
in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer.
so why hive takes 2 map-reduce?