You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/10/22 17:47:10 UTC

[Hadoop Wiki] Update of "LimitingTaskSlotUsage" by SomeOtherAccount

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "LimitingTaskSlotUsage" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/LimitingTaskSlotUsage

--------------------------------------------------

New page:


There are many reasons why one wants to limit the number of running tasks. 

* Job is consuming all task slots

The most common reason is because a given job is consuming all of the available task slots, preventing other jobs from running.   The easiest and best solution is to switch from the default FIFO scheduler to another scheduler, such as the FairShareScheduler or the CapacityScheduler.  Both support job tasks limit.

* Job has taken too many reduce slots that are still waiting for maps to finish

There is a job tunable called mapred.reduce.slowstart.completed.maps that sets the percentage of maps that must be completed before firing off reduce tasks.  By default, this is set to 5% (0.05) which for most shared clusters is likely too low.  Recommended values are closer to 80% or higher (0.80).  Note that for jobs that have a significant amount of intermediate data, setting this value higher will cause reduce slots to take more time fetching that data before performing work.

* Job is referencing an external, limited resource (such as a database)

In Hadoop terms, we call this a 'side-effect'.

One of the general assumptions of the framework is that there are not any side-effects. All tasks are expected to be restartable and a side-effect typically goes against the grain of this rule.

If a task absolutely must break the rules, there are a few things one can do:

** Deploy ZooKeeper and use it as a persistent lock to keep track of how many tasks are running concurrently
** Use a scheduler with a maximum task-per-queue feature and submit the job to that queue

* Job consumes too much RAM/disk IO/etc on a given node

The CapacityScheduler in 0.21 has a feature whereby one may use RAM-per-task to limit how many slots a given task takes.  By careful use of this feature, one may limit how many concurrent tasks on a given node a job may take.