You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by jason hadoop <ja...@gmail.com> on 2009/02/02 02:51:54 UTC

Re: How does Hadoop choose machines for Reducers?

mapred.tasktracker.reducetasks.maximum, is a start time only parameter and
can not be changed by setting it in the job conf.
The issue is that the Tasktracker reads it 1 time, at TaskTracker start
time.


On Fri, Jan 30, 2009 at 11:08 AM, Nathan Marz <na...@rapleaf.com> wrote:

> This is a huge problem for my application. I tried setting
> mapred.tasktracker.reduce.tasks.maximum to 1 in the job's JobConf, but that
> didn't have any effect. I'm using a custom output format and it's essential
> that Hadoop distribute the reduce tasks to make use of all the machine's as
> there is contention when multiple reduce tasks run on one machine. Since my
> number of reduce tasks is guaranteed to be less than the number of machines
> in the cluster, there's no reason for Hadoop not to make use of the full
> cluster.
>
> Does anyone know of a way to force Hadoop to distribute reduce tasks evenly
> across all the machines?
>
>
>
> On Jan 30, 2009, at 7:32 AM, jason hadoop wrote:
>
>  Hadoop just distributes to the available reduce execution slots. I don't
>> believe it pays attention to what machine they are on.
>> I believe the plan is to take account data locality in future (ie:
>> distribute tasks to machines that are considered more topologically close
>> to
>> their input split first, but I don't think this is available to most
>> users.)
>>
>>
>> On Thu, Jan 29, 2009 at 7:05 PM, Nathan Marz <na...@rapleaf.com> wrote:
>>
>>  I have a MapReduce application in which I configure 16 reducers to run on
>>> 15 machines. My mappers output exactly 16 keys, IntWritable's from 0 to
>>> 15.
>>> However, only 12 out of the 15 machines are used to run the 16 reducers
>>> (4
>>> machines have 2 reducers running on each). Is there a way to get Hadoop
>>> to
>>> use all the machines for reducing?
>>>
>>>
>

Re: How does Hadoop choose machines for Reducers?

Posted by jason hadoop <ja...@gmail.com>.

The only way to force even distribution is to ensure that the number of
reduces requested by the job is an even multiple of the total number of
reduce slots available in your cluster
(mapred.tasktracker.reduce.tasks.maximum*Number_Of_Slaves)
Even then if some of the reduces complete quickly you may have some task
trackers handling more reduces.
If mapred.tasktracker.reduce.tasks.maximum*Number_Of_Slaves == number of
reduces configured
and mapred.tasktracker.reduce.tasks.maximum == 1, you will get 1 reduce per
task tracker (almost always).

On Sun, Feb 1, 2009 at 5:51 PM, jason hadoop <ja...@gmail.com> wrote:

> mapred.tasktracker.reducetasks.maximum, is a start time only parameter and
> can not be changed by setting it in the job conf.
> The issue is that the Tasktracker reads it 1 time, at TaskTracker start
> time.
>
>
>
> On Fri, Jan 30, 2009 at 11:08 AM, Nathan Marz <na...@rapleaf.com> wrote:
>
>> This is a huge problem for my application. I tried setting
>> mapred.tasktracker.reduce.tasks.maximum to 1 in the job's JobConf, but that
>> didn't have any effect. I'm using a custom output format and it's essential
>> that Hadoop distribute the reduce tasks to make use of all the machine's as
>> there is contention when multiple reduce tasks run on one machine. Since my
>> number of reduce tasks is guaranteed to be less than the number of machines
>> in the cluster, there's no reason for Hadoop not to make use of the full
>> cluster.
>>
>> Does anyone know of a way to force Hadoop to distribute reduce tasks
>> evenly across all the machines?
>>
>>
>>
>> On Jan 30, 2009, at 7:32 AM, jason hadoop wrote:
>>
>>  Hadoop just distributes to the available reduce execution slots. I don't
>>> believe it pays attention to what machine they are on.
>>> I believe the plan is to take account data locality in future (ie:
>>> distribute tasks to machines that are considered more topologically close
>>> to
>>> their input split first, but I don't think this is available to most
>>> users.)
>>>
>>>
>>> On Thu, Jan 29, 2009 at 7:05 PM, Nathan Marz <na...@rapleaf.com> wrote:
>>>
>>>  I have a MapReduce application in which I configure 16 reducers to run
>>>> on
>>>> 15 machines. My mappers output exactly 16 keys, IntWritable's from 0 to
>>>> 15.
>>>> However, only 12 out of the 15 machines are used to run the 16 reducers
>>>> (4
>>>> machines have 2 reducers running on each). Is there a way to get Hadoop
>>>> to
>>>> use all the machines for reducing?
>>>>
>>>>
>>
>