You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Greg Mann (JIRA)" <ji...@apache.org> on 2019/01/17 00:50:00 UTC

[jira] [Commented] (MESOS-9460) Speculative operations may make master and agent resource views out of sync.

    [ https://issues.apache.org/jira/browse/MESOS-9460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744571#comment-16744571 ] 

Greg Mann commented on MESOS-9460:
----------------------------------

Moving this back to In Progress since I identified an issue with the patch that I need to address.

> Speculative operations may make master and agent resource views out of sync.
> ----------------------------------------------------------------------------
>
>                 Key: MESOS-9460
>                 URL: https://issues.apache.org/jira/browse/MESOS-9460
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.5.1, 1.6.1, 1.7.0
>            Reporter: Meng Zhu
>            Assignee: Greg Mann
>            Priority: Blocker
>              Labels: foundations
>
> When speculative operations (RESERVE, UNRESERVE, CREATE, DESTROY) are issued via the master operator API, the master updates the allocator state in {{Master::apply()}}, and then later updates its internal state in {{Master::_apply}}. This means that other updates to the allocator may be interleaved between these two continuations, causing the master state to be out of sync with the allocator state.
> This bug could happen with the following sequence of events:
> - agent (re)registers with the master
> - multiple speculative operation calls are made to the master via the operator API
> - the allocator is speculatively updated in https://github.com/apache/mesos/blob/1d1af190b0eb674beecf20646d0b6ce082db4ed0/src/master/master.cpp#L11326
> - before agent resource gets updated, it sends `UpdateSlaveMessage` when getting the (re)registered message if it has the capability `RESOURCE_PROVIDER` or oversubscription is used (https://github.com/apache/mesos/blob/3badf7179992e61f30f5a79da9d481dd451c7c2f/src/slave/slave.cpp#L1560-L1566 and https://github.com/apache/mesos/blob/3badf7179992e61f30f5a79da9d481dd451c7c2f/src/slave/slave.cpp#L1643-L1648)
> - as long as the first operation via the operator API has been added to the {{Slave}} struct at this point, then the master won't hit [this block here|https://github.com/apache/mesos/blob/1d1af190b0eb674beecf20646d0b6ce082db4ed0/src/master/master.cpp#L7940-L7945] and the `UpdateSlaveMessage` triggers allocator to update the total resources with STALE info from the {{Slave}} struct [here|https://github.com/apache/mesos/blob/1d1af190b0eb674beecf20646d0b6ce082db4ed0/src/master/master.cpp#L8207], thus the update from the previous operation is overwritten and LOST. Since the {{Slave}} struct has not yet been updated, the allocator update at that point uses stale resources from {{slave->totalResources}}.
> - agent finishes the operation and informs the master through `UpdateOperationStatusMessage` but for the speculative operation, we do not update the allocator https://github.com/apache/mesos/blob/3badf7179992e61f30f5a79da9d481dd451c7c2f/src/master/master.cpp#L11187-L11189
> - The resource views of the master/agent state and the allocator state are now inconsistent
> This caused MESOS-7971 and likely MESOS-9458 as well. 
> To fix this issue, we should make sure that updates to the allocator state and the master state are performed in a single synchronous block of code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)