You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by jeremy p <at...@gmail.com> on 2013/03/22 22:48:02 UTC

Capacity Scheduler question

I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
you a way to specify number of mappers on a per-job basis.
mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
it can help me.  In
this<http://hadoop.apache.org/docs/stable/capacity_scheduler.html>documentation,
all the settings under "Resource Allocation" are
cluster-wide.  I need to be able to set the maximum capacity on a given
machine.  It does look like you have the option to set the required amount
of memory per slot, but that setting applies to all the queues.  If I could
set that value on a per-queue basis, that would be helpful.

Will the capacity scheduler help me here?  Or am I barking up the wrong
tree?  If the capacity scheduler won't help me, can you think of anything
that will?

Thanks!

--Jeremy

Re: Capacity Scheduler question

Posted by jeremy p <at...@gmail.com>.
Thanks for the help.  Sadly, I don't think the Fair Scheduler will help me
here.  It will let you specify the number of
concurrent task slots for a pool, but that applies to the entire cluster.
 For a given pool, I need to set the maximum number of task slots per
machine.

On Fri, Mar 22, 2013 at 3:06 PM, Serge Blazhievsky <ha...@gmail.com>wrote:

> Take a look at fair scheduler it will do what you ask for
>
> Sent from my iPhone
>
> On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com>
> wrote:
>
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this<http://hadoop.apache.org/docs/stable/capacity_scheduler.html>documentation, all the settings under "Resource Allocation" are
> cluster-wide.  I need to be able to set the maximum capacity on a given
> machine.  It does look like you have the option to set the required amount
> of memory per slot, but that setting applies to all the queues.  If I could
> set that value on a per-queue basis, that would be helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy
>
>

Re: Capacity Scheduler question

Posted by jeremy p <at...@gmail.com>.
Thanks for the help.  Sadly, I don't think the Fair Scheduler will help me
here.  It will let you specify the number of
concurrent task slots for a pool, but that applies to the entire cluster.
 For a given pool, I need to set the maximum number of task slots per
machine.

On Fri, Mar 22, 2013 at 3:06 PM, Serge Blazhievsky <ha...@gmail.com>wrote:

> Take a look at fair scheduler it will do what you ask for
>
> Sent from my iPhone
>
> On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com>
> wrote:
>
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this<http://hadoop.apache.org/docs/stable/capacity_scheduler.html>documentation, all the settings under "Resource Allocation" are
> cluster-wide.  I need to be able to set the maximum capacity on a given
> machine.  It does look like you have the option to set the required amount
> of memory per slot, but that setting applies to all the queues.  If I could
> set that value on a per-queue basis, that would be helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy
>
>

Re: Capacity Scheduler question

Posted by jeremy p <at...@gmail.com>.
Thanks for the help.  Sadly, I don't think the Fair Scheduler will help me
here.  It will let you specify the number of
concurrent task slots for a pool, but that applies to the entire cluster.
 For a given pool, I need to set the maximum number of task slots per
machine.

On Fri, Mar 22, 2013 at 3:06 PM, Serge Blazhievsky <ha...@gmail.com>wrote:

> Take a look at fair scheduler it will do what you ask for
>
> Sent from my iPhone
>
> On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com>
> wrote:
>
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this<http://hadoop.apache.org/docs/stable/capacity_scheduler.html>documentation, all the settings under "Resource Allocation" are
> cluster-wide.  I need to be able to set the maximum capacity on a given
> machine.  It does look like you have the option to set the required amount
> of memory per slot, but that setting applies to all the queues.  If I could
> set that value on a per-queue basis, that would be helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy
>
>

Re: Capacity Scheduler question

Posted by jeremy p <at...@gmail.com>.
Thanks for the help.  Sadly, I don't think the Fair Scheduler will help me
here.  It will let you specify the number of
concurrent task slots for a pool, but that applies to the entire cluster.
 For a given pool, I need to set the maximum number of task slots per
machine.

On Fri, Mar 22, 2013 at 3:06 PM, Serge Blazhievsky <ha...@gmail.com>wrote:

> Take a look at fair scheduler it will do what you ask for
>
> Sent from my iPhone
>
> On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com>
> wrote:
>
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this<http://hadoop.apache.org/docs/stable/capacity_scheduler.html>documentation, all the settings under "Resource Allocation" are
> cluster-wide.  I need to be able to set the maximum capacity on a given
> machine.  It does look like you have the option to set the required amount
> of memory per slot, but that setting applies to all the queues.  If I could
> set that value on a per-queue basis, that would be helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy
>
>

Re: Capacity Scheduler question

Posted by Serge Blazhievsky <ha...@gmail.com>.
Take a look at fair scheduler it will do what you ask for

Sent from my iPhone

On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com> wrote:

> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give you a way to specify number of mappers on a per-job basis.  mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if it can help me.  In this documentation, all the settings under "Resource Allocation" are cluster-wide.  I need to be able to set the maximum capacity on a given machine.  It does look like you have the option to set the required amount of memory per slot, but that setting applies to all the queues.  If I could set that value on a per-queue basis, that would be helpful.
> 
> Will the capacity scheduler help me here?  Or am I barking up the wrong tree?  If the capacity scheduler won't help me, can you think of anything that will?
> 
> Thanks!
> 
> --Jeremy

Re: Capacity Scheduler question

Posted by Harsh J <ha...@cloudera.com>.
If you're looking to set a fixed # of maps per job and also control
their parallel distributed execution (by numbers), a Scheduler cannot
solve that for you but may assist in the process.

Setting a specific # of maps in a job to match something is certainly
not a Scheduler's work, as it only deals with what task needs to go
where. For you to control your job's # of maps (i.e. input splits),
tweak your Job's InputFormat#getSplits(…). The size of array it
returns dictates the total number of maps your job ends up running.

You are further limited by the fixed task slot behavior in 0.20.x/1.x
releases which use the MR1 framework (i.e. a JobTracker and a
TaskTracker). The property "mapred.tasktracker.map.tasks.maximum"
applies to a TaskTracker and not a per-job one as it name goes, and
isn't what you'd configure to seemingly achieve what you want.

In addition to this, YARN has a slotless NodeManager, wherein you can
ask for a certain amount of resources from your job on a per-task
level and have it granted globally. Meaning, if your NodeManager got
configured to use upto 8 GB, and your job/app requests 8 GB per
task/container, then only 1 such container can at most be run at one
time on any chosen NodeManager that serves 8 GB of memory resources.
Likewise, if your demand becomes 8/18 GB per container/task, then upto
18 containers can run in parallel at most on a given NM.

This is still not rigid though (less than 18 may run at the same time
on an NM as well, depending on the scheduler's distribution of
containers across all nodes), as that isn't MapReduce's goal in the
first place. If you want more rigidity consider writing your own YARN
application that implements such a distribution goal.

On Sat, Mar 23, 2013 at 3:18 AM, jeremy p
<at...@gmail.com> wrote:
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this documentation, all the settings under "Resource
> Allocation" are cluster-wide.  I need to be able to set the maximum capacity
> on a given machine.  It does look like you have the option to set the
> required amount of memory per slot, but that setting applies to all the
> queues.  If I could set that value on a per-queue basis, that would be
> helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy



-- 
Harsh J

Re: Capacity Scheduler question

Posted by Serge Blazhievsky <ha...@gmail.com>.
Take a look at fair scheduler it will do what you ask for

Sent from my iPhone

On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com> wrote:

> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give you a way to specify number of mappers on a per-job basis.  mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if it can help me.  In this documentation, all the settings under "Resource Allocation" are cluster-wide.  I need to be able to set the maximum capacity on a given machine.  It does look like you have the option to set the required amount of memory per slot, but that setting applies to all the queues.  If I could set that value on a per-queue basis, that would be helpful.
> 
> Will the capacity scheduler help me here?  Or am I barking up the wrong tree?  If the capacity scheduler won't help me, can you think of anything that will?
> 
> Thanks!
> 
> --Jeremy

Re: Capacity Scheduler question

Posted by Harsh J <ha...@cloudera.com>.
If you're looking to set a fixed # of maps per job and also control
their parallel distributed execution (by numbers), a Scheduler cannot
solve that for you but may assist in the process.

Setting a specific # of maps in a job to match something is certainly
not a Scheduler's work, as it only deals with what task needs to go
where. For you to control your job's # of maps (i.e. input splits),
tweak your Job's InputFormat#getSplits(…). The size of array it
returns dictates the total number of maps your job ends up running.

You are further limited by the fixed task slot behavior in 0.20.x/1.x
releases which use the MR1 framework (i.e. a JobTracker and a
TaskTracker). The property "mapred.tasktracker.map.tasks.maximum"
applies to a TaskTracker and not a per-job one as it name goes, and
isn't what you'd configure to seemingly achieve what you want.

In addition to this, YARN has a slotless NodeManager, wherein you can
ask for a certain amount of resources from your job on a per-task
level and have it granted globally. Meaning, if your NodeManager got
configured to use upto 8 GB, and your job/app requests 8 GB per
task/container, then only 1 such container can at most be run at one
time on any chosen NodeManager that serves 8 GB of memory resources.
Likewise, if your demand becomes 8/18 GB per container/task, then upto
18 containers can run in parallel at most on a given NM.

This is still not rigid though (less than 18 may run at the same time
on an NM as well, depending on the scheduler's distribution of
containers across all nodes), as that isn't MapReduce's goal in the
first place. If you want more rigidity consider writing your own YARN
application that implements such a distribution goal.

On Sat, Mar 23, 2013 at 3:18 AM, jeremy p
<at...@gmail.com> wrote:
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this documentation, all the settings under "Resource
> Allocation" are cluster-wide.  I need to be able to set the maximum capacity
> on a given machine.  It does look like you have the option to set the
> required amount of memory per slot, but that setting applies to all the
> queues.  If I could set that value on a per-queue basis, that would be
> helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy



-- 
Harsh J

Re: Capacity Scheduler question

Posted by Serge Blazhievsky <ha...@gmail.com>.
Take a look at fair scheduler it will do what you ask for

Sent from my iPhone

On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com> wrote:

> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give you a way to specify number of mappers on a per-job basis.  mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if it can help me.  In this documentation, all the settings under "Resource Allocation" are cluster-wide.  I need to be able to set the maximum capacity on a given machine.  It does look like you have the option to set the required amount of memory per slot, but that setting applies to all the queues.  If I could set that value on a per-queue basis, that would be helpful.
> 
> Will the capacity scheduler help me here?  Or am I barking up the wrong tree?  If the capacity scheduler won't help me, can you think of anything that will?
> 
> Thanks!
> 
> --Jeremy

Re: Capacity Scheduler question

Posted by Harsh J <ha...@cloudera.com>.
If you're looking to set a fixed # of maps per job and also control
their parallel distributed execution (by numbers), a Scheduler cannot
solve that for you but may assist in the process.

Setting a specific # of maps in a job to match something is certainly
not a Scheduler's work, as it only deals with what task needs to go
where. For you to control your job's # of maps (i.e. input splits),
tweak your Job's InputFormat#getSplits(…). The size of array it
returns dictates the total number of maps your job ends up running.

You are further limited by the fixed task slot behavior in 0.20.x/1.x
releases which use the MR1 framework (i.e. a JobTracker and a
TaskTracker). The property "mapred.tasktracker.map.tasks.maximum"
applies to a TaskTracker and not a per-job one as it name goes, and
isn't what you'd configure to seemingly achieve what you want.

In addition to this, YARN has a slotless NodeManager, wherein you can
ask for a certain amount of resources from your job on a per-task
level and have it granted globally. Meaning, if your NodeManager got
configured to use upto 8 GB, and your job/app requests 8 GB per
task/container, then only 1 such container can at most be run at one
time on any chosen NodeManager that serves 8 GB of memory resources.
Likewise, if your demand becomes 8/18 GB per container/task, then upto
18 containers can run in parallel at most on a given NM.

This is still not rigid though (less than 18 may run at the same time
on an NM as well, depending on the scheduler's distribution of
containers across all nodes), as that isn't MapReduce's goal in the
first place. If you want more rigidity consider writing your own YARN
application that implements such a distribution goal.

On Sat, Mar 23, 2013 at 3:18 AM, jeremy p
<at...@gmail.com> wrote:
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this documentation, all the settings under "Resource
> Allocation" are cluster-wide.  I need to be able to set the maximum capacity
> on a given machine.  It does look like you have the option to set the
> required amount of memory per slot, but that setting applies to all the
> queues.  If I could set that value on a per-queue basis, that would be
> helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy



-- 
Harsh J

Re: Capacity Scheduler question

Posted by Serge Blazhievsky <ha...@gmail.com>.
Take a look at fair scheduler it will do what you ask for

Sent from my iPhone

On Mar 22, 2013, at 2:48 PM, jeremy p <at...@gmail.com> wrote:

> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give you a way to specify number of mappers on a per-job basis.  mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if it can help me.  In this documentation, all the settings under "Resource Allocation" are cluster-wide.  I need to be able to set the maximum capacity on a given machine.  It does look like you have the option to set the required amount of memory per slot, but that setting applies to all the queues.  If I could set that value on a per-queue basis, that would be helpful.
> 
> Will the capacity scheduler help me here?  Or am I barking up the wrong tree?  If the capacity scheduler won't help me, can you think of anything that will?
> 
> Thanks!
> 
> --Jeremy

Re: Capacity Scheduler question

Posted by Harsh J <ha...@cloudera.com>.
If you're looking to set a fixed # of maps per job and also control
their parallel distributed execution (by numbers), a Scheduler cannot
solve that for you but may assist in the process.

Setting a specific # of maps in a job to match something is certainly
not a Scheduler's work, as it only deals with what task needs to go
where. For you to control your job's # of maps (i.e. input splits),
tweak your Job's InputFormat#getSplits(…). The size of array it
returns dictates the total number of maps your job ends up running.

You are further limited by the fixed task slot behavior in 0.20.x/1.x
releases which use the MR1 framework (i.e. a JobTracker and a
TaskTracker). The property "mapred.tasktracker.map.tasks.maximum"
applies to a TaskTracker and not a per-job one as it name goes, and
isn't what you'd configure to seemingly achieve what you want.

In addition to this, YARN has a slotless NodeManager, wherein you can
ask for a certain amount of resources from your job on a per-task
level and have it granted globally. Meaning, if your NodeManager got
configured to use upto 8 GB, and your job/app requests 8 GB per
task/container, then only 1 such container can at most be run at one
time on any chosen NodeManager that serves 8 GB of memory resources.
Likewise, if your demand becomes 8/18 GB per container/task, then upto
18 containers can run in parallel at most on a given NM.

This is still not rigid though (less than 18 may run at the same time
on an NM as well, depending on the scheduler's distribution of
containers across all nodes), as that isn't MapReduce's goal in the
first place. If you want more rigidity consider writing your own YARN
application that implements such a distribution goal.

On Sat, Mar 23, 2013 at 3:18 AM, jeremy p
<at...@gmail.com> wrote:
> I have two jobs, Job A and Job B.  Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine.  Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing.  I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me.  In this documentation, all the settings under "Resource
> Allocation" are cluster-wide.  I need to be able to set the maximum capacity
> on a given machine.  It does look like you have the option to set the
> required amount of memory per slot, but that setting applies to all the
> queues.  If I could set that value on a per-queue basis, that would be
> helpful.
>
> Will the capacity scheduler help me here?  Or am I barking up the wrong
> tree?  If the capacity scheduler won't help me, can you think of anything
> that will?
>
> Thanks!
>
> --Jeremy



-- 
Harsh J