You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Shai Erera <se...@gmail.com> on 2010/11/25 19:35:53 UTC

Control the number of Mappers

Hi

Is there a way to make MapReduce create exactly N Mappers? More
specifically, if say my data can be split to 200 Mappers, and I have only
100 cores, how can I ensure only 100 Mappers will be created? The number of
cores is not something I know in advance, so writing a special InputFormat
might be tricky, unless I can query Hadoop for the available # of cores (in
the entire cluster).

Thanks
Shai

Re: Control the number of Mappers

Posted by Niels Basjes <Ni...@basjes.nl>.

Ah,

In that case this should answer your question:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces


2010/11/25 Shai Erera <se...@gmail.com>:
> I wasn't talking about how to configure the cluster to not invoke more than
> a certain # of Mappers simultaneously. Instead, I'd like to configure a
> (certain) job to invoke exactly N Mappers, where N is the number of cores in
> the cluster. Irregardless of the size of the data. This is not critical if
> it can't be done, but it can improve the performance of my job if it can be
> done.
>
> Thanks
> Shai
>
> On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>
>> Hi,
>>
>> 2010/11/25 Shai Erera <se...@gmail.com>:
>> > Is there a way to make MapReduce create exactly N Mappers? More
>> > specifically, if say my data can be split to 200 Mappers, and I have
>> > only
>> > 100 cores, how can I ensure only 100 Mappers will be created? The number
>> > of
>> > cores is not something I know in advance, so writing a special
>> > InputFormat
>> > might be tricky, unless I can query Hadoop for the available # of cores
>> > (in
>> > the entire cluster).
>>
>> You can configure on a node by node basis how many map and reduce
>> tasks can be started by the task tracker on that node.
>> This is done via the conf/mapred-site.xml using these two settings:
>> mapred.tasktracker.{map|reduce}.tasks.maximum
>>
>> Have a look at this page for more information
>> http://hadoop.apache.org/common/docs/current/cluster_setup.html
>>
>> --
>> Met vriendelijke groeten,
>>
>> Niels Basjes
>
>



-- 
Met vriendelijke groeten,

Niels Basjes

Re: Control the number of Mappers

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.

More to your need, (I had missed this earlier),
>>The number of cores is not something I know in advance, so writing a special InputFormat might be tricky, unless I can query Hadoop for the available # of cores

You dont have to write a fancy InputFormat.
Once you have an (correct) implementation of MultiFileInputFormat ,
then from my drive program which launches my map reduce job I would do
something like this :

int numMappers = myMagicalFunctionReturningNumOfCores()
job.setNumMapTasks(numMappers);

-Shrijeet

On Thu, Nov 25, 2010 at 12:23 PM, Shai Erera <se...@gmail.com> wrote:
>
> Thanks, I'll take a look
>
> On Thu, Nov 25, 2010 at 10:20 PM, Shrijeet Paliwal <sh...@rocketfuel.com> wrote:
>>
>> Shai,
>> You will have to implement MultiFileInputFormat  and set that has your input format.
>> You may find http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/examples/MultiFileWordCount.html useful.
>>
>> On Thu, Nov 25, 2010 at 12:01 PM, Shai Erera <se...@gmail.com> wrote:
>>>
>>> I wasn't talking about how to configure the cluster to not invoke more than a certain # of Mappers simultaneously. Instead, I'd like to configure a (certain) job to invoke exactly N Mappers, where N is the number of cores in the cluster. Irregardless of the size of the data. This is not critical if it can't be done, but it can improve the performance of my job if it can be done.
>>>
>>> Thanks
>>> Shai
>>>
>>> On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 2010/11/25 Shai Erera <se...@gmail.com>:
>>>> > Is there a way to make MapReduce create exactly N Mappers? More
>>>> > specifically, if say my data can be split to 200 Mappers, and I have only
>>>> > 100 cores, how can I ensure only 100 Mappers will be created? The number of
>>>> > cores is not something I know in advance, so writing a special InputFormat
>>>> > might be tricky, unless I can query Hadoop for the available # of cores (in
>>>> > the entire cluster).
>>>>
>>>> You can configure on a node by node basis how many map and reduce
>>>> tasks can be started by the task tracker on that node.
>>>> This is done via the conf/mapred-site.xml using these two settings:
>>>> mapred.tasktracker.{map|reduce}.tasks.maximum
>>>>
>>>> Have a look at this page for more information
>>>> http://hadoop.apache.org/common/docs/current/cluster_setup.html
>>>>
>>>> --
>>>> Met vriendelijke groeten,
>>>>
>>>> Niels Basjes
>>>
>>
>

Re: Control the number of Mappers

Posted by Shai Erera <se...@gmail.com>.

Thanks, I'll take a look

On Thu, Nov 25, 2010 at 10:20 PM, Shrijeet Paliwal
<sh...@rocketfuel.com>wrote:

> Shai,
>
> You will have to implement MultiFileInputFormat <http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/MultiFileInputFormat.html> and
> set that has your input format.
> You may find
> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/examples/MultiFileWordCount.html
>  useful.
>
>
> On Thu, Nov 25, 2010 at 12:01 PM, Shai Erera <se...@gmail.com> wrote:
>
>> I wasn't talking about how to configure the cluster to not invoke more
>> than a certain # of Mappers simultaneously. Instead, I'd like to configure a
>> (certain) job to invoke exactly N Mappers, where N is the number of cores in
>> the cluster. Irregardless of the size of the data. This is not critical if
>> it can't be done, but it can improve the performance of my job if it can be
>> done.
>>
>> Thanks
>> Shai
>>
>>
>> On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>
>>> Hi,
>>>
>>> 2010/11/25 Shai Erera <se...@gmail.com>:
>>> > Is there a way to make MapReduce create exactly N Mappers? More
>>> > specifically, if say my data can be split to 200 Mappers, and I have
>>> only
>>> > 100 cores, how can I ensure only 100 Mappers will be created? The
>>> number of
>>> > cores is not something I know in advance, so writing a special
>>> InputFormat
>>> > might be tricky, unless I can query Hadoop for the available # of cores
>>> (in
>>> > the entire cluster).
>>>
>>> You can configure on a node by node basis how many map and reduce
>>> tasks can be started by the task tracker on that node.
>>> This is done via the conf/mapred-site.xml using these two settings:
>>> mapred.tasktracker.{map|reduce}.tasks.maximum
>>>
>>> Have a look at this page for more information
>>> http://hadoop.apache.org/common/docs/current/cluster_setup.html
>>>
>>> --
>>> Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>>
>>
>>
>

Re: Control the number of Mappers

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.

Shai,

You will have to implement MultiFileInputFormat
<http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/MultiFileInputFormat.html>
and
set that has your input format.
You may find
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/examples/MultiFileWordCount.html
 useful.

On Thu, Nov 25, 2010 at 12:01 PM, Shai Erera <se...@gmail.com> wrote:

> I wasn't talking about how to configure the cluster to not invoke more than
> a certain # of Mappers simultaneously. Instead, I'd like to configure a
> (certain) job to invoke exactly N Mappers, where N is the number of cores in
> the cluster. Irregardless of the size of the data. This is not critical if
> it can't be done, but it can improve the performance of my job if it can be
> done.
>
> Thanks
> Shai
>
>
> On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>
>> Hi,
>>
>> 2010/11/25 Shai Erera <se...@gmail.com>:
>> > Is there a way to make MapReduce create exactly N Mappers? More
>> > specifically, if say my data can be split to 200 Mappers, and I have
>> only
>> > 100 cores, how can I ensure only 100 Mappers will be created? The number
>> of
>> > cores is not something I know in advance, so writing a special
>> InputFormat
>> > might be tricky, unless I can query Hadoop for the available # of cores
>> (in
>> > the entire cluster).
>>
>> You can configure on a node by node basis how many map and reduce
>> tasks can be started by the task tracker on that node.
>> This is done via the conf/mapred-site.xml using these two settings:
>> mapred.tasktracker.{map|reduce}.tasks.maximum
>>
>> Have a look at this page for more information
>> http://hadoop.apache.org/common/docs/current/cluster_setup.html
>>
>> --
>> Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>

Re: Control the number of Mappers

Posted by Shai Erera <se...@gmail.com>.

I wasn't talking about how to configure the cluster to not invoke more than
a certain # of Mappers simultaneously. Instead, I'd like to configure a
(certain) job to invoke exactly N Mappers, where N is the number of cores in
the cluster. Irregardless of the size of the data. This is not critical if
it can't be done, but it can improve the performance of my job if it can be
done.

Thanks
Shai

On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
>
> 2010/11/25 Shai Erera <se...@gmail.com>:
> > Is there a way to make MapReduce create exactly N Mappers? More
> > specifically, if say my data can be split to 200 Mappers, and I have only
> > 100 cores, how can I ensure only 100 Mappers will be created? The number
> of
> > cores is not something I know in advance, so writing a special
> InputFormat
> > might be tricky, unless I can query Hadoop for the available # of cores
> (in
> > the entire cluster).
>
> You can configure on a node by node basis how many map and reduce
> tasks can be started by the task tracker on that node.
> This is done via the conf/mapred-site.xml using these two settings:
> mapred.tasktracker.{map|reduce}.tasks.maximum
>
> Have a look at this page for more information
> http://hadoop.apache.org/common/docs/current/cluster_setup.html
>
> --
> Met vriendelijke groeten,
>
> Niels Basjes
>

Re: Control the number of Mappers

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

2010/11/25 Shai Erera <se...@gmail.com>:
> Is there a way to make MapReduce create exactly N Mappers? More
> specifically, if say my data can be split to 200 Mappers, and I have only
> 100 cores, how can I ensure only 100 Mappers will be created? The number of
> cores is not something I know in advance, so writing a special InputFormat
> might be tricky, unless I can query Hadoop for the available # of cores (in
> the entire cluster).

You can configure on a node by node basis how many map and reduce
tasks can be started by the task tracker on that node.
This is done via the conf/mapred-site.xml using these two settings:
mapred.tasktracker.{map|reduce}.tasks.maximum

Have a look at this page for more information
http://hadoop.apache.org/common/docs/current/cluster_setup.html

-- 
Met vriendelijke groeten,

Niels Basjes