You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Jeff Zhang <zj...@gmail.com> on 2009/11/22 16:42:25 UTC

Ideas for dynamic change reducer task number ?

Hi all,



During my work, I often run the same map reduce jobs on different size of
data set. The mapper task number can change automatically according the
input data set. But I have to set different reducer number according
different data set size.

But I do not want to be bothered to do that, it is not convenient for users
in my opinions. And it will also harm the system’s automation(because in an
automation system we can not predict the size of input data set). So I think
hadoop should have a more intelligent mechanism to control the reducer
number according the input data.

Here I suggest to add an new interface named ReduceNumManager which has a
method getReduceNum(InputFormat inputFormat)  the code snippet is as
following (the interface needs to be refined):



*public** **interface** ReduceNumManager {***

* *

*    **int** getReduceNum(InputFormat inputFormat);***

*}***



And users can set this class in JobConf by JobConf.setReduceNumMamanger.
And the JobClient use this class to determine the reduce number.

e.g. if the InputFormat is the FileInputFormat, then we can have a
FileReduceNumManager which implements this interface, and this class compute
the reducer number according the size of input file.



I think this work will benefit users and Pig and Hive(not sure) will also
benefit from this, because it is not convenient for users to set different
reduce numbers each time using the same script but for different size of
data set.

If we provide such a mechanism , they only need to provide their customized
implementation.



This is my initial idea, looking forward to hear from experts’ feedback.



Thank you



Jeff Zhang

Re: Ideas for dynamic change reducer task number ?

Posted by Jeff Zhang <zj...@gmail.com>.
Owen,

It works, thank you for your help.


Jeff Zhang



On Tue, Nov 24, 2009 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:

>
> You're right, I will try that.
>
> Thank you
>
>
> Jeff Zhang
>
>
>
> On Mon, Nov 23, 2009 at 9:19 AM, Owen O'Malley <om...@apache.org> wrote:
>
>>
>> On Nov 22, 2009, at 4:48 PM, Jeff Zhang wrote:
>>
>>  My concern is that it is just like hard code to use
>>> conf.setNumReduceTasks
>>> on the configuration. It is not flexible, so my idea is that adding an
>>> interface to change the reducer number dynamically according the
>>> different
>>> size of input data set.
>>>
>>
>> You misunderstand. I meant doing something like:
>>
>> public class MyInputFormat ....
>>
>>  public InputSplit[] getSplits(JobConf conf) {
>>     InputSplit[] result = ...;
>>     // compute total size of input
>>     conf.setNumReduceTasks(max(6, size / 10G));
>>  }
>> }
>>
>> I haven't checked the code to make sure it will work, but I believe it
>> will.
>>
>> -- Owen
>>
>
>

Re: Ideas for dynamic change reducer task number ?

Posted by Jeff Zhang <zj...@gmail.com>.
You're right, I will try that.

Thank you


Jeff Zhang


On Mon, Nov 23, 2009 at 9:19 AM, Owen O'Malley <om...@apache.org> wrote:

>
> On Nov 22, 2009, at 4:48 PM, Jeff Zhang wrote:
>
>  My concern is that it is just like hard code to use conf.setNumReduceTasks
>> on the configuration. It is not flexible, so my idea is that adding an
>> interface to change the reducer number dynamically according the different
>> size of input data set.
>>
>
> You misunderstand. I meant doing something like:
>
> public class MyInputFormat ....
>
>  public InputSplit[] getSplits(JobConf conf) {
>     InputSplit[] result = ...;
>     // compute total size of input
>     conf.setNumReduceTasks(max(6, size / 10G));
>  }
> }
>
> I haven't checked the code to make sure it will work, but I believe it
> will.
>
> -- Owen
>

Re: Ideas for dynamic change reducer task number ?

Posted by Owen O'Malley <om...@apache.org>.
On Nov 22, 2009, at 4:48 PM, Jeff Zhang wrote:

> My concern is that it is just like hard code to use  
> conf.setNumReduceTasks
> on the configuration. It is not flexible, so my idea is that adding an
> interface to change the reducer number dynamically according the  
> different
> size of input data set.

You misunderstand. I meant doing something like:

public class MyInputFormat ....

   public InputSplit[] getSplits(JobConf conf) {
      InputSplit[] result = ...;
      // compute total size of input
      conf.setNumReduceTasks(max(6, size / 10G));
   }
}

I haven't checked the code to make sure it will work, but I believe it  
will.

-- Owen

Re: Ideas for dynamic change reducer task number ?

Posted by Jeff Zhang <zj...@gmail.com>.
Owen,

My concern is that it is just like hard code to use conf.setNumReduceTasks
on the configuration. It is not flexible, so my idea is that adding an
interface to change the reducer number dynamically according the different
size of input data set.

Jeff Zhang



On Sun, Nov 22, 2009 at 11:10 AM, Owen O'Malley <ow...@gmail.com>wrote:

> I'd suggest trying to do conf.setNumReduceTasks on the configuration passed
> to the InputFormat in getSplits. It will probably just work.
>
> -- Owen
>

Re: Ideas for dynamic change reducer task number ?

Posted by Owen O'Malley <ow...@gmail.com>.
I'd suggest trying to do conf.setNumReduceTasks on the configuration passed
to the InputFormat in getSplits. It will probably just work.

-- Owen

Re: Ideas for dynamic change reducer task number ?

Posted by Zhang Zhaoning <zz...@gmail.com>.
I think we have the same goal, could you please read mapredure-1226 for some
idea or give your comments?


2009/11/22 Jeff Zhang <zj...@gmail.com>

> Hi all,
>
>
>
> During my work, I often run the same map reduce jobs on different size of
> data set. The mapper task number can change automatically according the
> input data set. But I have to set different reducer number according
> different data set size.
>
> But I do not want to be bothered to do that, it is not convenient for users
> in my opinions. And it will also harm the system’s automation(because in an
> automation system we can not predict the size of input data set). So I
> think
> hadoop should have a more intelligent mechanism to control the reducer
> number according the input data.
>
> Here I suggest to add an new interface named ReduceNumManager which has a
> method getReduceNum(InputFormat inputFormat)  the code snippet is as
> following (the interface needs to be refined):
>
>
>
> *public** **interface** ReduceNumManager {***
>
> * *
>
> *    **int** getReduceNum(InputFormat inputFormat);***
>
> *}***
>
>
>
> And users can set this class in JobConf by JobConf.setReduceNumMamanger.
> And the JobClient use this class to determine the reduce number.
>
> e.g. if the InputFormat is the FileInputFormat, then we can have a
> FileReduceNumManager which implements this interface, and this class
> compute
> the reducer number according the size of input file.
>
>
>
> I think this work will benefit users and Pig and Hive(not sure) will also
> benefit from this, because it is not convenient for users to set different
> reduce numbers each time using the same script but for different size of
> data set.
>
> If we provide such a mechanism , they only need to provide their customized
> implementation.
>
>
>
> This is my initial idea, looking forward to hear from experts’ feedback.
>
>
>
> Thank you
>
>
>
> Jeff Zhang
>



-- 
Thank you!
谢谢!

张钊宁
zzningxp