You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Bannier (JIRA)" <ji...@apache.org> on 2017/04/11 13:56:41 UTC
[jira] [Commented] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

    [ https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964402#comment-15964402 ] 

Benjamin Bannier commented on MESOS-7375:
-----------------------------------------

The {{GPU_RESOURCES}} framework capability exists as a fix for clusters with a low number of GPU agents. If we'd unconditionally offer resources on GPU agents to frameworks not running any tasks using {{gpus}} we might run out of auxillary resources also needed for GPU tasks (e.g., {{cpus}} or {{disk}}). This might render these agent unusable even for frameworks wanting to use {{gpus}} (but require auxillary resources). 

In the other extreme you described here our fix has adverse effects. When every agent has GPUs attached but no frameworks wants to run GPU tasks (i.e., when no framework declared {{GPU_RESOURCES}}), no offers will be made and all cluster resources will become idle. The fix you proposed does fix this extreme, but I think then cannot guarantee that enough resources will be available in the case of a low number of GPU agents, so I am unsure how just adding such a flag would be enough to fix the issue for all possible (or even the majority of) possible setups.

It seems one of the deeper issues surfacing here is that the way our allocator takes topology into account is limited (only coarse grained offers, wDRF taking only globally accumlated resources into account). At the same time it is hard for schedulers to get a global picture without capturing e.g., a lot of the state known to Mesos. An operator on the other hand already has knowledge of the eventually available resources in the cluster and their topology, so I wonder where e.g., multirole is available, if it would be possible for operators to make sure that sufficient auxillary resources are available to make use of GPUs on agents, e.g., with reservations to dedicated roles.

> provide additional insight for framework developers re: GPU_RESOURCES capability
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-7375
>                 URL: https://issues.apache.org/jira/browse/MESOS-7375
>             Project: Mesos
>          Issue Type: Documentation
>            Reporter: James DeFelice
>              Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. This is surprising for operators.
> Even when a framework doesn't **need** GPU resources, it may make sense for a framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag that results in the framework advertising the `GPU_RESOURCES` capability even though it does not intend to consume any GPU. The effect being that said framework will now receive offers on clusters where all nodes have GPU resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)