You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Ilya Pronin (JIRA)" <ji...@apache.org> on 2017/06/27 19:44:00 UTC
[jira] [Commented] (MESOS-4092) Try to re-establish connection on ping timeouts with agent before removing it

    [ https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065357#comment-16065357 ] 

Ilya Pronin commented on MESOS-4092:
------------------------------------

Looks like our problem here is that we use our health-check for detecting remote-peer failure and link failure, but don't distinguish them. When a connection breaks, libprocess issues {{ExitedEvent}} and opens a new connection when required. But in the case of a network problem a relatively long time may pass before TCP retransmissions limit is reached and the connection is declared dead.

One possible solution can be to try using the aforementioned "relink" functionality at some point during agent pinging. We can use a strategy similar to the one used by TCP: after N consecutive failed pings "relink" before sending the next ping. Plus a similar thing on the agent's side.

Another possible solution can be to use TCP keepalive mechanism tuned to "detect" broken connections faster than {{agent_ping_timeout * max_agent_ping_timeouts}}. Or we can mess with TCP user timeout, but IMO it's a road to hell and AFAIK user timeout is available only on Linux.

> Try to re-establish connection on ping timeouts with agent before removing it
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-4092
>                 URL: https://issues.apache.org/jira/browse/MESOS-4092
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.25.0
>            Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. This can occur because of transient network failures, e.g., gray failures of a switch uplink exhibiting heavy or total packet loss. Some network architectures are designed to tolerate such gray failures and support multiple paths between hosts. This can be implemented with equal-cost multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple possible uplinks. In such networks re-establishing a TCP connection will almost certainly use a new source port and thus will likely be hashed to a different uplink, avoiding the failed uplink and re-establishing connectivity with the agent.
> After failing to receive pongs the SlaveObserver should next try to re-establish a TCP connection (with exponential back-off) before declaring the agent as lost. This can avoid significant disruption where large numbers of agents reached through a single failed link could be removed unnecessarily while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)