You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Hindman (JIRA)" <ji...@apache.org> on 2014/03/06 00:12:43 UTC

[jira] [Commented] (MESOS-1058) Master CHECK failure: hierarchical_allocator_process.hpp:421 Check failed: !slaves.contains(slaveId)

    [ https://issues.apache.org/jira/browse/MESOS-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921588#comment-13921588 ] 

Benjamin Hindman commented on MESOS-1058:
-----------------------------------------

Waiting for the registrar to fix the second issue SGTM.

> Master CHECK failure: hierarchical_allocator_process.hpp:421 Check failed: !slaves.contains(slaveId)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1058
>                 URL: https://issues.apache.org/jira/browse/MESOS-1058
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>    Affects Versions: 0.17.0, 0.16.0, 0.15.0, 0.18.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>             Fix For: 0.19.0
>
>
> We've observed this CHECK failure in production when the following situation occurs:
> 1. Slave asks to Register with Master.
> 2. Master adds slave with ID 1 and sends acknowledgment.
> 3. Acknowledgement to the slave is dropped due to one-way partition.
> 4. Slave continues to retry.
> 5. Master detects spurious socket closure on slave, marks slave as disconnected.
> 6. Slave did not exit, re-detects Master, and asks to Register.
> 7. Master::registerSlave decides to remove "old disconnected slave".
> BUG: Master::removeSlave does not remove the old slave from the allocator!
> 8. Master::registerSlave adds slave with ID 2 and sends acknowledgement.
> 9. Slave receives ID 1 acknowledgement, and checkpoints.
> 10. Slave receives ID 2 acknowledgement, and exits from mismatch.
> 11. Slave recovers and attempts to re-register with checkpointed ID 1.
> 12. Master allows this (no Registrar yet), and attempts to add the slave to the allocator (because of BUG above, CHECK fails in the allocator).
> The first bug here is that the Master does not remove a slave from the allocator in Master::removeSlave if the slave is disconnected! This was likely a regression when Allocator::slaveDisconnected was introduced, and we neglected to make the necessary update to Master::removeSlave. This is an easy fix.
> The second bug is that the Slave's ID was inconsistent with the Master, and the slave exited, only to re-register with the inconsistent ID. If the above bug is fixed, this means we'll allow the slave to re-register in the Master after having told frameworks the slave is lost. I'm tempted to punt on this bug since with the Registrar, this situation would be prevented as the re-registration would be denied. Also, we already expose this edge-case slave inconsistency to frameworks in other situations without the Registrar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)