You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "David Robinson (JIRA)" <ji...@apache.org> on 2016/05/10 04:29:12 UTC

[jira] [Issue Comment Deleted] (MESOS-5330) Agent should backoff before connecting to the master

     [ https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Robinson updated MESOS-5330:
----------------------------------
    Comment: was deleted

(was: https://reviews.apache.org/r/47080/)

> Agent should backoff before connecting to the master
> ----------------------------------------------------
>
>                 Key: MESOS-5330
>                 URL: https://issues.apache.org/jira/browse/MESOS-5330
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: David Robinson
>            Assignee: David Robinson
>
> When an agent is started it starts a background task (libprocess process?) to detect the leading master. When the leading master is detected (or changes) the [SocketManager's link() method is called and a TCP connection to the master is established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954]. The agent _then_ backs off before sending a ReRegisterSlave message via the newly established connection. The agent needs to backoff _before_ attempting to establish a TCP connection to the master, not before sending the first message over the connection.
> During scale tests at Twitter we discovered that agents can SYN flood the master upon leader changes, then the problem described in MESOS-5200 can occur where ephemeral connections are used, which exacerbates the problem. The end result is a lot of hosts setting up and tearing down TCP connections every slave_ping_timeout seconds (15 by default), connections failing to be established, hosts being marked as unhealthy and being shutdown. We observed ~800 passive TCP connections per second on the leading master during scale tests.
> The problem can be somewhat mitigated by tuning the kernel to handle a thundering herd of TCP connections, but ideally there would not be a thundering herd to begin with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)