You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Guangya Liu (JIRA)" <ji...@apache.org> on 2016/10/06 01:34:21 UTC

[jira] [Created] (MESOS-6317) Race in master update slave.

Guangya Liu created MESOS-6317:
----------------------------------

             Summary: Race in master update slave.
                 Key: MESOS-6317
                 URL: https://issues.apache.org/jira/browse/MESOS-6317
             Project: Mesos
          Issue Type: Bug
            Reporter: Guangya Liu
            Assignee: Guangya Liu


Currently, when {{updateSlave}} in master, it will first rescind offers and then updateSlave in allocator, but there is a race for this, there might be a batch allocation inserted bwteen the two. In this case, the order will be rescind offer -> batch allocation -> update slave. This order will cause some issues when the oversubscribed resources was decreased.

Suppose the oversubscribed resources was decreased from 2 to 1, then after rescind offer finished, the batch allocation will allocate the old 2 oversubscribed resources again, then update slave will update the total oversubscribed resources to 1. This will cause the agent host have some time overcommitted due to the tasks can still use 2 oversubscribed resources but not 1 oversubscribed resources, once the tasks using the 2 oversubscribed resources finished, everything goes back.

So here we should adjust the order of rescind offer and updateSlave in master to avoid resource overcommit.

If we update slave first then rescind offer, the order will be update slave -> batch allocation -> rescind offer, this order will have no problem when descreasing resources. Suppose the oversubscribed resources was decreased from 2 to 1, then update slave will update total oversubscribed resources to 1 directly, then the batch allocation will not allocate any oversubscribed resources since there are more allocated than total oversubscribed resources, then rescind offer will rescind all offers using oversubscribed resources. This will not lead the agent host to be overcommitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)