You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2017/06/16 23:23:00 UTC
[jira] [Updated] (MESOS-7688) Improve master failover performance
by reducing unnecessary agent retries.
[ https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler updated MESOS-7688:
-----------------------------------
Description:
Currently, during a failover the agents will (re-)register with the master. While the master is recovering, the master may drop messages from the agents, and so the agents must retry registration using a backoff mechanism. For large clusters, there can be a lot of overhead in processing unnecessary retries from the agents, given that these messages must be deserialized and contain all of the task / executor information many times over.
In order to reduce this overhead, the idea is to avoid the need for agents to blindly retry (re-)registration with the master. Two approaches for this are:
(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of an abuse of MasterInfo unfortunately, but the idea is for agents to only (re-)register when they see that the master reaches a recovered state. Once recovered, the master will not drop messages, and therefore agents only need to retry when the connection breaks.
(2) Have the master reply with a retry message when it's in the recovering state, so that agents get a clear signal that their messages were dropped. The agents only retry when the connection breaks or they get a retry message. This one is less optimal, because the master may have to process a lot of messages and send retries, but once the master is recovered, the master will process only a single (re-)registration from each agent. The number of (re-)registrations that occur while the master is recovering can be reduced to 1 in this approach if the master sends the retry message only after the master completes recovery.
was:
Currently, during a failover the agents will (re-)register with the master. While the master is recovering, the master may drop messages from the agents, and so the agents must retry registration using a backoff mechanism. For large clusters, there can be a lot of overhead in processing unnecessary retries from the agents, given that these messages must be deserialized and contain all of the task / executor information many times over.
In order to reduce this overhead, the idea is to avoid the need for agents to blindly retry (re-)registration with the master. Two approaches for this are:
(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of an abuse of MasterInfo unfortunately, but the idea is for agents to only (re-)register when they see that the master reaches a recovered state. Once recovered, the master will not drop messages, and therefore agents only need to retry when the connection breaks.
(2) Have the master reply with a retry message when it's in the recovering state, so that agents get a clear signal that their messages were dropped. This one is less optimal, because the master may have to process a lot of messages and send retries, but once the master is recovered, the master will process only a single (re-)registration from each agent. Here, agents only retry when the connection breaks or they get a retry message.
> Improve master failover performance by reducing unnecessary agent retries.
> --------------------------------------------------------------------------
>
> Key: MESOS-7688
> URL: https://issues.apache.org/jira/browse/MESOS-7688
> Project: Mesos
> Issue Type: Improvement
> Components: agent, master
> Reporter: Benjamin Mahler
> Labels: scalability
>
> Currently, during a failover the agents will (re-)register with the master. While the master is recovering, the master may drop messages from the agents, and so the agents must retry registration using a backoff mechanism. For large clusters, there can be a lot of overhead in processing unnecessary retries from the agents, given that these messages must be deserialized and contain all of the task / executor information many times over.
> In order to reduce this overhead, the idea is to avoid the need for agents to blindly retry (re-)registration with the master. Two approaches for this are:
> (1) Update the MasterInfo in ZK when the master is recovered. This is a bit of an abuse of MasterInfo unfortunately, but the idea is for agents to only (re-)register when they see that the master reaches a recovered state. Once recovered, the master will not drop messages, and therefore agents only need to retry when the connection breaks.
> (2) Have the master reply with a retry message when it's in the recovering state, so that agents get a clear signal that their messages were dropped. The agents only retry when the connection breaks or they get a retry message. This one is less optimal, because the master may have to process a lot of messages and send retries, but once the master is recovered, the master will process only a single (re-)registration from each agent. The number of (re-)registrations that occur while the master is recovering can be reduced to 1 in this approach if the master sends the retry message only after the master completes recovery.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)