You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Bannier (JIRA)" <ji...@apache.org> on 2019/03/18 08:17:00 UTC

[jira] [Commented] (MESOS-9313) Document speculative offer operation semantics for framework writers.

    [ https://issues.apache.org/jira/browse/MESOS-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794823#comment-16794823 ] 

Benjamin Bannier commented on MESOS-9313:
-----------------------------------------

I would argue that since a framework _always_ needs to anticipate failures (e.g., temporary or permanent agent node failures, messages getting lost in failovers etc.), knowing about the internal implementation of certain offer operations inside Mesos should be inconsequential to the framework implementation. Here a speculation failure (e.g., failure of an agent to checkpoint a speculatively applied {{RESERVE}} operation) should not be different from e.g., the operation getting lost by an agent failover while the operation is in flight. A framework would use the same approach to reconcile agent state.

Note that before the introduction of the non-speculative operations {{CREATE_DISK}}, {{CREATE_VOLUME}} (and their {{DESTROY}} counterparts), _all operations_ were applied speculatively.

> Document speculative offer operation semantics for framework writers.
> ---------------------------------------------------------------------
>
>                 Key: MESOS-9313
>                 URL: https://issues.apache.org/jira/browse/MESOS-9313
>             Project: Mesos
>          Issue Type: Documentation
>          Components: documentation
>            Reporter: James DeFelice
>            Priority: Major
>              Labels: mesosphere, operation-feedback, operations
>
> It recently came to my attention that a subset of offer operations (e.g. RESERVE, UNRESERVE, et al.) are implemented speculatively within mesos master. Meaning that the master will apply the resource conversion internally **before** the conversion is checkpointed on the agent. The master may then re-offer the converted resource to a framework -- even though the agent may still not have checkpointed the resource conversion. If the checkpointing process on the agent fails, then subsequent operations issued for the falsely-offered resource will fail. Because the master essentially "lied" to the framework about the true state of the supposedly-converted resource.
> It's also been explained to me that this case is expected to be rare. However, it *can* impact the design/implementation of framework state machines and so it's critical that this information be documented clearly - outside of the C++ code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)