You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2017/06/19 20:43:00 UTC

[jira] [Comment Edited] (MESOS-7639) Oversubscription could crash the master due to CHECK failure in the allocator

    [ https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054700#comment-16054700 ] 

Yan Xu edited comment on MESOS-7639 at 6/19/17 8:42 PM:
--------------------------------------------------------

I think you are right that you stopped seeing the crash because

{quote}it now updates the frameworkSorter by offeredResources rather than frameworkAllocation{quote}

In your test the following is happening

1. {{updateSlave}} changes {{cpus\(\*\)\{REV\}:10}} to {{cpus\(\*)\{REV\}:8}} in totals.
2. {{RESERVE}} is changes {{cpus\(\*\)\{REV\}:10}} to {{cpus(default-role)\{REV\}:10}} in allocations.

When you operate on the full framework allocation, the reserve would include the revocable resources which would fail the CHECK while when you operate on the offered resources the check wouldn't include the revocable resources. 

However I don't think the situation is corrected by {{Master::_accept}} because it doesn't update totals. Eventually the total will be corrected by [this|https://github.com/apache/mesos/blob/e00cceda4c31b71017cc9db860e3cf038bbf1d77/src/master/allocator/mesos/hierarchical.cpp#L668] but in the meantime I think if a task is launched with the over-allocated revocable cpus it's going to cause troubles on the agent (it doesn't look like this is caught by the current master validation). Perhaps you can change your test to verify task launch instead of reservation?

We can probably change the master validation to catch this but in general I feel we should make sure that "all allocator operations should atomically maintain the consistency of its internal state", relying on a followup operation to attempt to fix the inconsistent state is problematic and hard to troubleshoot when it doesn't crash but rather messes with the allocator math in a subtle way. However this is a design limitation of the current allocator API and harder to fix.


was (Author: xujyan):
I think you are right that you stopped seeing the crash because

{quote}it now updates the frameworkSorter by offeredResources rather than frameworkAllocation{quote}

In your test the following is happening

1. {{updateSlave}} changes {{cpus\(*)\{REV\}:10}} to {{cpus\(*)\{REV\}:8}} in totals.
2. {{RESERVE}} is changes {{cpus\(*)\{REV\}:10}} to {{cpus(default-role)\{REV\}:10}} in allocations.

When you operate on the full framework allocation, the reserve would include the revocable resources which would fail the CHECK while when you operate on the offered resources the check wouldn't include the revocable resources. 

However I don't think the situation is corrected by {{Master::_accept}} because it doesn't update totals. Eventually the total will be corrected by [this|https://github.com/apache/mesos/blob/e00cceda4c31b71017cc9db860e3cf038bbf1d77/src/master/allocator/mesos/hierarchical.cpp#L668] but in the meantime I think if a task is launched with the over-allocated revocable cpus it's going to cause troubles on the agent (it doesn't look like this is caught by the current master validation). Perhaps you can change your test to verify task launch instead of reservation?

We can probably change the master validation to catch this but in general I feel we should make sure that "all allocator operations should atomically maintain the consistency of its internal state", relying on a followup operation to attempt to fix the inconsistent state is problematic and hard to troubleshoot when it doesn't crash but rather messes with the allocator math in a subtle way. However this is a design limitation of the current allocator API and harder to fix.

> Oversubscription could crash the master due to CHECK failure in the allocator
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7639
>                 URL: https://issues.apache.org/jira/browse/MESOS-7639
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>
> As I described in MESOS-7566, the following scenario is possible when the agent sends updated oversubscribed resources to the master:
> - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
> - {{Master::updateSlave}} upon receiving the update would first call {{HierarchicalAllocatorProcess::updateSlave}}, followed by {{allocator->recoverResources}}.
> - {{HierarchicalAllocatorProcess::updateSlave}} would update {{roleSorter.total_}} to reduce to total so the total could go below the allocation.
> - In the subsequent {{allocator->recoverResources}} call the attempt to remove outstanding allocation may fail to reduce it to below the total because some allocation may not be in outstanding offers. It could be in offered resources pending between {{Master::accept}} and {{Master::_accept}}. So the end result could still be {{total < allocation}}.
> - Then when {{Master::_accept}} is executed, it will then call {{allocator->updateAllocation}}, in which the {{total < allocation}} condition could trigger such crash.
> The gist is that there are resources that are neither in master's {{offers}} or tracked in the allocator when {{Master::updateSlave}} is called.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)