You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2010/12/11 03:11:03 UTC

[jira] Commented: (MAPREDUCE-2205) FairScheduler should only preempt tasks for pools/jobs that are up next for scheduling

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970365#action_12970365 ] 

Joydeep Sen Sarma commented on MAPREDUCE-2205:
----------------------------------------------

i think i phrased this jira very badly.

here's the real problem: job A preempts job B. after preemption, job B keeps getting slots before job A gets all the slots it has asked for. after sometime jobA requests even more preemption.

Note that it's ok for some jobC to get slots (if it's higher than jobA in priority based on fair share). i don't want to get into reintroducing some kind of logic to handle starvation (which is what forcing jobA to be scheduled before jobC would basically amount to). We had deficits earlier to deal with starvation - but that was very difficult to explain/deal with. i am ok with the current behavior - ultimately resources will be taken from jobs that are overscheduled to those that are underscheduled.

spent a fair bit of time looking at our logs on why this phenomenon might be happening. the biggest contributing factor (so far) seems to be the policy in FS.assignTasks() to cycle through jobs (in 0.20) while assigning tasks for a given heartbeat. we frequently get into situations where a TT is advertising multiple slots (because we are bottlenecked on JT and heartbeat processing is slow) and the JT will not give all the slots to the highest priority job. the faircomparator is doing the right thing (the logs indicate that jobB gets slots only after jobA has gotten the first slot from the heartbeat).

So we need to strike a better balance perhaps on having diversity of jobs on a machine versus giving higher priority jobs multiple slots when heartbeats arrive.

> FairScheduler should only preempt tasks for pools/jobs that are up next for scheduling
> --------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2205
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2205
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/fair-share
>            Reporter: Joydeep Sen Sarma
>
> We have hit a problem with the preemption implementation in the FairScheduler where the following happens:
> # job X runs short of fair share or min share and requests/causes N tasks to be preempted
> # when slots are then scheduled - tasks from some other job are actually scheduled
> # after preemption_interval has passed, job X finds it's still underscheduled and requests preemption. goto 1.
> This has caused widespread preemption of tasks and the cluster going from high utilization to low utilization in a few minutes.
> Some of the problems are specific to our internal version of hadoop (still 0.20 and doesn't have the hierarchical FairScheduler) - but i think the issue here is generic (just took a look at the trunk assignTasks and tasksToPreempt routines). The basic problem seems to be that the logic of assignTasks+FairShareComparator is not consistent with the logic in tasksToPreempt(). The latter can choose to preempt tasks on behalf of jobs that may not be first up for scheduling based on the FairComparator. Understanding whether these two separate pieces of logic are consistent and keeping it that way is difficult.
> It seems that a much safer preemption implementation is to walk the jobs in the order they would be scheduled on the next heartbeat - and only preempt for jobs that are at the head of this sorted queue. In MAPREDUCE-2048 - we have already introduced a pre-sorted list of jobs ordered by current scheduling priority. It seems much easier to preempt only jobs at the head of this sorted list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.