You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jie Yu (JIRA)" <ji...@apache.org> on 2017/10/29 15:17:00 UTC

[jira] [Commented] (MESOS-8058) Agent and master can race when updating agent state

    [ https://issues.apache.org/jira/browse/MESOS-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224035#comment-16224035 ] 

Jie Yu commented on MESOS-8058:
-------------------------------

commit 33d1ff1798f8cbf83b4e5f7bc79dbf8e231dff1f
Author: Jie Yu <yu...@gmail.com>
Date:   Tue Oct 10 20:17:53 2017 -0700

    Stopped sending checkpoint resources message on agent re-registration.

    Given that resource provider capable agents will send update slave
    message to the master during re-registration, no need for the master
    to send checkpoint resources message to the agent anymore.

    This also makes the code more consistent because agent should be the
    source of truth. This also eliminates the possible retry incurred by
    this message, which is never the intention.

    Review: https://reviews.apache.org/r/62879

> Agent and master can race when updating agent state
> ---------------------------------------------------
>
>                 Key: MESOS-8058
>                 URL: https://issues.apache.org/jira/browse/MESOS-8058
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.5.0
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Critical
>              Labels: mesosphere
>             Fix For: 1.5.0
>
>
> In {{2af9a5b07dc80151154264e974d03f56a1c25838}} we introduce the use of {{UpdateSlaveMessage}} for the agent to inform the master about its current total resources. Currently we trigger this message only on agent registration and reregistration.
> This can race with operations applied in the master and communicated via {{CheckpointResourcesMessage}}.
> Example:
> 1. Agent ({{cpus:4(\*)}} registers.
> 2. Master is triggered to apply an operation to the agent's resources, e.g., a reservation: {{cpus:4(\*) -> cpus:4(A)}}. The master applies the operation to its current view of the agent's resources and sends the agent a {{CheckpointResourcesMessage}} so the agent can persist the result.
> 3. The agent sends the master an {{UpdateSlaveMessage}}, e.g., {{cpus:4(\*)}} since it hasn't received the {{CheckpointResourcesMessage}} yet.
> 4. The master processes the {{UpdateSlaveMessage}} and updates its view of the agent's resources to be {{cpus:4(\*)}}.
> 5. The agent processes the {{CheckpointResourcesMessage}} and updates its view of its resources to be {{cpus:4(A)}}.
> 6. The agent and the master have an inconsistent view of the agent's resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)