You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Carlo Curino (JIRA)" <ji...@apache.org> on 2013/04/09 04:29:14 UTC

[jira] [Commented] (YARN-45) Scheduler feedback to AM to release containers

    [ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626139#comment-13626139 ] 

Carlo Curino commented on YARN-45:
----------------------------------


High level idea:

The philosophy behind preemption is that we give the AM a heads up about resources that are likely to be taken away, and give it the opportunity to save the state of its tasks. A separate kill-based mechanism already exists (leveraged by the FairScheduler for its preemption) to forcibly recover containers.
This is our fallback if the AM does not release containers in a certain amount of time (note that due to the quickly evolving conditions of the cluster we might not kill a container if at a later time we realize this is not strictly needed to achieve fairness/capacity). This means an AM can be written completely ignoring preemption hints, and would work correctly (although it might waste useful work).

The goal is to allow for smart local policies in the AM, which leverage application-level understanding of the ongoing computation to face the imminent reduction of resources (e.g., by saving the state of the computation to a checkpoint, by promoting partial output, by migrating competencies to other tasks, by try to complete the work quickly). The goal is to spare the RM from understanding application-level optimization concerns but rather focus on resource management issues. As a consequence we envision (among others) preemption requests that are not fully bounded, allowing the AM to leverage some flexibility. Note that the significant "lag"  imposed by the heartbeat protocols between RM-AM and AM-Tasks and NM-RM force us to consider in most cases preemption actions to be limited to a rather long time horizon. We can't expect to operate in a tight sub-second control loop, but rather trigger changes in the cluster allocation in the orders of tens of seconds. As a consequence preemption should be used to correct macroscopic issues that are likely to be somewhat stable over time, rather than micro-managing container allocations.

We consider the following use cases for preemption: 
# Scheduling policies aimed at rebalancing some global property such as capacity or fairness. This allows to go for example over capacity on a queue and get resources back as the cluster conditions change. 
# Scheduling policies that are making point decisions about individual containers (e.g., preempt a container on a machine and restart it elsewhere to improve data locality, or preempting containers on a box that is observing excessive IOs).
# Administrative actions that are aimed at modifying the cluster allocations without wasting work (e.g., draining a machine or a rack before taking it offline for maintenance), manually reducing allocations for a job, etc.

Use cases 1 and 3 can be implemented by picking containers at the RM, or by expressing a "broad" request of a certain amount of resources (we reuse the ResourceRequest for this, in a way that is symmetric to the AM request) and let the AM to bound this to specific containers. While use case 2 is more likely to be implemented using ContainerIDs.



Protocol change proposal:

Our proposal consists in extending the ResourceResponse with a PreemptRequest message (further extensible in the future) that contains a Set<ContainerId> and a Set<ResourceRequest>.  The current semantics is that these two sets are non-overlapping (i.e., if I ask for a specific container and a ResourceRequest the AM is supposed to satisfy both). Once again, as we never rely on the AM to "enforce" preemption but we have a kill-based fallback, the AM implementation is not required to understand the preemption requests (nor even acknowledging their receiving). This make for an simple upgrade story and one could run mixed preemption-aware and not-preemption-aware AMs on the same cluster. 

A current open question we would like input on is whether to have the PreemptRequest to be a union-type where we have either sets (but not both together), or whether to allow, as we do in the attached patch, for both to co-exists in the same PreemptRequest. We do not have a current need for the "both" use case, but maybe others do. thoughts?



Coming up next:

We are cleaning up further patches to the FairScheduler, CapacityScheduler and ApplicationMasterService leveraging this AM-RM protocol, and changes to the mapreduce AM that implements work-saving preemption via checkpointing for Shuffle and Reducers (while for Mappers we are currently "making a run for it" given the commonly short runtime of maps). The other patches will be posted soon. 

                
> Scheduler feedback to AM to release containers
> ----------------------------------------------
>
>                 Key: YARN-45
>                 URL: https://issues.apache.org/jira/browse/YARN-45
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Chris Douglas
>         Attachments: YARN-45.patch
>
>
> The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers.
> [1] http://research.yahoo.com/files/yl-2012-003.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira