You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2013/09/09 19:55:55 UTC

[jira] [Created] (MESOS-682) Master should properly consolidate "slaves" and "deactivated" maps

Vinod Kone created MESOS-682:
--------------------------------

             Summary: Master should properly consolidate "slaves" and "deactivated" maps
                 Key: MESOS-682
                 URL: https://issues.apache.org/jira/browse/MESOS-682
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 0.13.0, 0.14.0
            Reporter: Vinod Kone
            Assignee: Benjamin Mahler
             Fix For: 0.15.0


Currently, the master keeps track of active slaves with "slaves" map and deactivated slaves with "deactivated" map. While the former is indexed on SlaveID the latter is index on pid. This could lead to inconsistencies regarding the state of the slaves.

We have seen this in production at Twitter. 

Slave was given id 201308072143-2082809866-5050-35234-5186 at 16:35:59. After ~22 minutes master removed the slave, presumably because of network partition. The slave received shutdown and restarted at 17:08:01. It then registered with the master at 17:08:31 and got a new id 201308072143-2082809866-5050-35234-5193. But then it was immediately considered "disconnected" (not sure why) by the master and removed. When the slave came back up it got yet another pid 201308072143-2082809866-5050-35234-5194.

The surprising bit is that at 17:08:32 it got another re-register message (probably backed up somewhere in the network?) from the same slave with the old pid 201308072143-2082809866-5050-35234-5186. Since this id doesn't exist in the master's slaves map, master thought it was a new slave and added it. When the slave got the ack for this re-registration message it committed suicide (as expected) because the id it received was un-expected. Now the master removed the slave with id 201308072143-2082809866-5050-35234-5186 from its slaves map based on the pid. Note that was completely arbitrary, because the master could just as well have removed the slave id 201308072143-2082809866-5050-35234-5194 from its map. This is because the master just loops through all entries in "slaves" and picks the first one that matches the pid.

At this point the slave's pid was added to "deactivated" but there exists a slave (201308072143-2082809866-5050-35234-5194) in the slaves map with the same pid!
When it eventually received a status update from the slave, the master crashed (as expected) because the message was from a slave whose pid is in "deactivated" but present in "slaves".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira