You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Safdar Kureishy <sa...@gmail.com> on 2012/09/10 11:06:14 UTC

Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

Hi,

I need to run some benchmarking tests for a given mapreduce job on a *subset
*of a 10-node Hadoop cluster. Not that it matters, but the current cluster
settings allow for ~20 map slots and 10 reduce slots per node.

Without loss of generalization, let's say I want a job with these
constraints below:
- to use only *5* out of the 10 nodes for running the mappers,
- to use only *5* out of the 10 nodes for running the reducers.

Is there any other way of achieving this through Hadoop property overrides
during job-submission time? I understand that the Fair Scheduler can
potentially be used to create pools of a proportionate # of mappers and
reducers, to achieve a similar outcome, but the problem is that I still
cannot tie such a pool to a fixed # of machines (right?). Essentially,
regardless of the # of map/reduce tasks involved, I only want a *fixed # of
machines* to handle the job.

Any tips on how I can go about achieving this?

Thanks,
Safdar

Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

Posted by Safdar Kureishy <sa...@gmail.com>.
Thanks Bertrand/Hemanth, for your prompt replies! This helps :)

Regards,
Safdar


On Mon, Sep 10, 2012 at 2:18 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> If that is only for benchmarking, you could stop the task-trackers on the
> machines you don't want to use.
> Or you could setup another cluster.
>
> But yes, there is not standard way to limit the slots taken by a job to a
> specified set of machines.
> You might be able to do it using a custom Scheduler but that would be out
> of your scope, I guess.
>
> Regards
>
> Bertrand
>
> On Mon, Sep 10, 2012 at 12:01 PM, Hemanth Yamijala <yhemanth@gmail.com
> >wrote:
>
> > Hi,
> >
> > I am not sure if there's any way to restrict the tasks to specific
> > machines. However, I think there are some ways of restricting to
> > number of 'slots' that can be used by the job.
> >
> > Also, not sure which version of Hadoop you are on. The
> > capacityscheduler
> > (
> >
> http://hadoop.apache.org/common/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> > )
> > has ways by which you can set up a queue with a hard capacity limit.
> > The capacity controls the number of slots that that can be used by
> > jobs submitted to the queue. So, if you submit a job to the queue,
> > irrespective of the number of tasks it has, it should limit it to
> > those slots.  However, please note that this does not restrict the
> > tasks to specific machines.
> >
> > Thanks
> > Hemanth
> >
> > On Mon, Sep 10, 2012 at 2:36 PM, Safdar Kureishy
> > <sa...@gmail.com> wrote:
> > > Hi,
> > >
> > > I need to run some benchmarking tests for a given mapreduce job on a
> > *subset
> > > *of a 10-node Hadoop cluster. Not that it matters, but the current
> > cluster
> > > settings allow for ~20 map slots and 10 reduce slots per node.
> > >
> > > Without loss of generalization, let's say I want a job with these
> > > constraints below:
> > > - to use only *5* out of the 10 nodes for running the mappers,
> > > - to use only *5* out of the 10 nodes for running the reducers.
> > >
> > > Is there any other way of achieving this through Hadoop property
> > overrides
> > > during job-submission time? I understand that the Fair Scheduler can
> > > potentially be used to create pools of a proportionate # of mappers and
> > > reducers, to achieve a similar outcome, but the problem is that I still
> > > cannot tie such a pool to a fixed # of machines (right?). Essentially,
> > > regardless of the # of map/reduce tasks involved, I only want a *fixed
> #
> > of
> > > machines* to handle the job.
> > >
> > > Any tips on how I can go about achieving this?
> > >
> > > Thanks,
> > > Safdar
> >
>
>
>
> --
> Bertrand Dechoux
>

Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

Posted by Bertrand Dechoux <de...@gmail.com>.
If that is only for benchmarking, you could stop the task-trackers on the
machines you don't want to use.
Or you could setup another cluster.

But yes, there is not standard way to limit the slots taken by a job to a
specified set of machines.
You might be able to do it using a custom Scheduler but that would be out
of your scope, I guess.

Regards

Bertrand

On Mon, Sep 10, 2012 at 12:01 PM, Hemanth Yamijala <yh...@gmail.com>wrote:

> Hi,
>
> I am not sure if there's any way to restrict the tasks to specific
> machines. However, I think there are some ways of restricting to
> number of 'slots' that can be used by the job.
>
> Also, not sure which version of Hadoop you are on. The
> capacityscheduler
> (
> http://hadoop.apache.org/common/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> )
> has ways by which you can set up a queue with a hard capacity limit.
> The capacity controls the number of slots that that can be used by
> jobs submitted to the queue. So, if you submit a job to the queue,
> irrespective of the number of tasks it has, it should limit it to
> those slots.  However, please note that this does not restrict the
> tasks to specific machines.
>
> Thanks
> Hemanth
>
> On Mon, Sep 10, 2012 at 2:36 PM, Safdar Kureishy
> <sa...@gmail.com> wrote:
> > Hi,
> >
> > I need to run some benchmarking tests for a given mapreduce job on a
> *subset
> > *of a 10-node Hadoop cluster. Not that it matters, but the current
> cluster
> > settings allow for ~20 map slots and 10 reduce slots per node.
> >
> > Without loss of generalization, let's say I want a job with these
> > constraints below:
> > - to use only *5* out of the 10 nodes for running the mappers,
> > - to use only *5* out of the 10 nodes for running the reducers.
> >
> > Is there any other way of achieving this through Hadoop property
> overrides
> > during job-submission time? I understand that the Fair Scheduler can
> > potentially be used to create pools of a proportionate # of mappers and
> > reducers, to achieve a similar outcome, but the problem is that I still
> > cannot tie such a pool to a fixed # of machines (right?). Essentially,
> > regardless of the # of map/reduce tasks involved, I only want a *fixed #
> of
> > machines* to handle the job.
> >
> > Any tips on how I can go about achieving this?
> >
> > Thanks,
> > Safdar
>



-- 
Bertrand Dechoux

Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

Posted by Hemanth Yamijala <yh...@gmail.com>.
Hi,

I am not sure if there's any way to restrict the tasks to specific
machines. However, I think there are some ways of restricting to
number of 'slots' that can be used by the job.

Also, not sure which version of Hadoop you are on. The
capacityscheduler
(http://hadoop.apache.org/common/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html)
has ways by which you can set up a queue with a hard capacity limit.
The capacity controls the number of slots that that can be used by
jobs submitted to the queue. So, if you submit a job to the queue,
irrespective of the number of tasks it has, it should limit it to
those slots.  However, please note that this does not restrict the
tasks to specific machines.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 2:36 PM, Safdar Kureishy
<sa...@gmail.com> wrote:
> Hi,
>
> I need to run some benchmarking tests for a given mapreduce job on a *subset
> *of a 10-node Hadoop cluster. Not that it matters, but the current cluster
> settings allow for ~20 map slots and 10 reduce slots per node.
>
> Without loss of generalization, let's say I want a job with these
> constraints below:
> - to use only *5* out of the 10 nodes for running the mappers,
> - to use only *5* out of the 10 nodes for running the reducers.
>
> Is there any other way of achieving this through Hadoop property overrides
> during job-submission time? I understand that the Fair Scheduler can
> potentially be used to create pools of a proportionate # of mappers and
> reducers, to achieve a similar outcome, but the problem is that I still
> cannot tie such a pool to a fixed # of machines (right?). Essentially,
> regardless of the # of map/reduce tasks involved, I only want a *fixed # of
> machines* to handle the job.
>
> Any tips on how I can go about achieving this?
>
> Thanks,
> Safdar

Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

Posted by "M. C. Srivas" <mc...@gmail.com>.
What you are asking for (and much more sophisticated "slicing/dicing" of
the cluster) is possible with MapR's distro. Please contact me offline if
you are interested, or try it for yourself at www.mapr.com/download

On Mon, Sep 10, 2012 at 2:06 AM, Safdar Kureishy
<sa...@gmail.com>wrote:

> Hi,
>
> I need to run some benchmarking tests for a given mapreduce job on a
> *subset
> *of a 10-node Hadoop cluster. Not that it matters, but the current cluster
> settings allow for ~20 map slots and 10 reduce slots per node.
>
> Without loss of generalization, let's say I want a job with these
> constraints below:
> - to use only *5* out of the 10 nodes for running the mappers,
> - to use only *5* out of the 10 nodes for running the reducers.
>
> Is there any other way of achieving this through Hadoop property overrides
> during job-submission time? I understand that the Fair Scheduler can
> potentially be used to create pools of a proportionate # of mappers and
> reducers, to achieve a similar outcome, but the problem is that I still
> cannot tie such a pool to a fixed # of machines (right?). Essentially,
> regardless of the # of map/reduce tasks involved, I only want a *fixed # of
> machines* to handle the job.
>
> Any tips on how I can go about achieving this?
>
> Thanks,
> Safdar
>