You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Dominic Hamon (JIRA)" <ji...@apache.org> on 2014/06/23 01:42:24 UTC
[jira] [Created] (MESOS-1529) Handle a network partition between
Master and Slave
Dominic Hamon created MESOS-1529:
------------------------------------
Summary: Handle a network partition between Master and Slave
Key: MESOS-1529
URL: https://issues.apache.org/jira/browse/MESOS-1529
Project: Mesos
Issue Type: Bug
Reporter: Dominic Hamon
If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run.
There are at least two possible approaches to solving this issue:
1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur.
2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision.
--
This message was sent by Atlassian JIRA
(v6.2#6252)