You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Ian Downes (JIRA)" <ji...@apache.org> on 2015/12/08 00:42:11 UTC

[jira] [Created] (MESOS-4092) Try to re-establish connection on ping timeouts with agent before removing it

Ian Downes created MESOS-4092:
---------------------------------

             Summary: Try to re-establish connection on ping timeouts with agent before removing it
                 Key: MESOS-4092
                 URL: https://issues.apache.org/jira/browse/MESOS-4092
             Project: Mesos
          Issue Type: Improvement
          Components: master
    Affects Versions: 0.25.0
            Reporter: Ian Downes


The SlaveObserver will trigger an agent to be removed after {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. This can occur because of transient network failures, e.g., gray failures of a switch uplink exhibiting heavy or total packet loss. Some network architectures are designed to tolerate such gray failures and support multiple paths between hosts. This can be implemented with equal-cost multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple possible uplinks. In such networks re-establishing a TCP connection will almost certainly use a new source port and thus will likely be hashed to a different uplink, avoiding the failed uplink and re-establishing connectivity with the agent.

After failing to receive pongs the SlaveObserver should next try to re-establish a TCP connection (with exponential back-off) before declaring the agent as lost. This can avoid significant disruption where large numbers of agents reached through a single failed link could be removed unnecessarily while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)