You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2017/06/01 04:58:04 UTC

[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

    [ https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032442#comment-16032442 ] 

Yan Xu commented on MESOS-7566:
-------------------------------

Certain scenarios do seem problematic to me, e.g.,

- The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
- {{Master::updateSlave}} upon receiving the update would first call {{HierarchicalAllocatorProcess::updateSlave}}, followed by {{allocator->recoverResources}}.
- {{HierarchicalAllocatorProcess::updateSlave}} would update {{roleSorter.total_}} to reduce to total so the total could go below the allocation.
- In the subsequent {{allocator->recoverResources}} call the attempt to remove outstanding allocation may fail to reduce it to below the total because some allocation may not be in outstanding offers. It could be in offered resources pending between {{Master::accept}} and {{Master::_accept}}. So the end result could still be {{total < allocation}}.
- Then when {{Master::_accept}} is executed, it will then call {{allocator->updateAllocation}}, in which the {{total < allocation}} condition could trigger such crash.

The root issue indeed looks to be MESOS-4553.

/cc [~bmahler] [~mcypark]

> Master crash due to failed check in DRFSorter::remove
> -----------------------------------------------------
>
>                 Key: MESOS-7566
>                 URL: https://issues.apache.org/jira/browse/MESOS-7566
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.1.1, 1.1.2
>            Reporter: Zhitao Li
>            Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355] is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)