You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Dmitry Zhuk (JIRA)" <ji...@apache.org> on 2017/06/22 12:10:00 UTC
[jira] [Commented] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.

    [ https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059249#comment-16059249 ] 

Dmitry Zhuk commented on MESOS-7688:
------------------------------------

[~bmahler], dropping messages during recovery is not a big deal actually... They are processed pretty fast, and have no significant impact on what's happening next (or even make it easier for master, as backoff is increased in these agents).

See [^1.2.0.png] for some details of what's happening after recovery. This is basically based on counting messages in logs. It's based on 1.2.0. I expect it to be even worse on HEAD, due to extra continuation added here: https://github.com/apache/mesos/commit/29fc2dfcb110a51923d4d7c144bdd797b348f96b#diff-28ebad5255c4c1a70b4abf35651198fdR5430
00:00 - time when the first re-registration request is seen by the master.
*reregistering* ("Re-registering agent ..." in log) - total number of reregistrations master started. There are two types of these: initial reregistration, and extra reregistrations, because agent sent several requests, and master processed last of them after initial reregistration was completed.
*ignoring* ("Ignoring re-register agent message from agent ...") - total number of reregistrations ignored, because reregistration of this agent is currently in progress. Some of them are caused by an oversight in backoff algorithm. Once agent sent reregistration request, it chooses random delay between 0 and current backoff, which is close to 0 for lots of agents in a large cluster. See MESOS-7087
*reregistered* ("Re-registered agent ...") - total number of agents reregistered (includes only initial reregistrations).
*sending* ("Sending updated checkpointed resources ...") - total number of replies sent to reregistration request. This should eventually match *reregistering*.
*update* ("Received update of agent ...") - agent sends this, when it received reregistration confirmation.
*applied* ("Applied ... operations in ...") - total number of operations applied to the registry. This stops growing after some time, as no new agents are being registered.

Almost all this time Master libprocess queue is clogged with {{MessageEvent}} and {{DispatchEvent}} (tens of thousands in queue).

So why is this happening?
{{Master::reregisterSlave}} and {{Master::_reregisterSlave}} are processed slower then re-registrations are received.
One optimization is here: https://reviews.apache.org/r/60003/. It also requires adding some std::moves and using some tricks to move protobuf objects to use it at full. I think it can be improved to handle moves of protobufs transparently.
Another one was discussed in MESOS-6972
Other messages processed by master during this time (such as updates from agents, or sending offers) do not have major impact. I tried prioritising events in master queue (those related to reregistration were processed first), and it somewhat improved things, but the low priority queue was drained in 1-3sec once all agents reregistered.
There are also some minor things, like converting protobuf {{Resource}} lists to {{Resources}} multiple times, unjustified use of {{Owned}} in {{Framework::completedTasks}}, but these are really microptimizations.

> Improve master failover performance by reducing unnecessary agent retries.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-7688
>                 URL: https://issues.apache.org/jira/browse/MESOS-7688
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent, master
>            Reporter: Benjamin Mahler
>              Labels: scalability
>         Attachments: 1.2.0.png
>
>
> Currently, during a failover the agents will (re-)register with the master. While the master is recovering, the master may drop messages from the agents, and so the agents must retry registration using a backoff mechanism. For large clusters, there can be a lot of overhead in processing unnecessary retries from the agents, given that these messages must be deserialized and contain all of the task / executor information many times over.
> In order to reduce this overhead, the idea is to avoid the need for agents to blindly retry (re-)registration with the master. Two approaches for this are:
> (1) Update the MasterInfo in ZK when the master is recovered. This is a bit of an abuse of MasterInfo unfortunately, but the idea is for agents to only (re-)register when they see that the master reaches a recovered state. Once recovered, the master will not drop messages, and therefore agents only need to retry when the connection breaks.
> (2) Have the master reply with a retry message when it's in the recovering state, so that agents get a clear signal that their messages were dropped. The agents only retry when the connection breaks or they get a retry message. This one is less optimal, because the master may have to process a lot of messages and send retries, but once the master is recovered, the master will process only a single (re-)registration from each agent. The number of (re-)registrations that occur while the master is recovering can be reduced to 1 in this approach if the master sends the retry message only after the master completes recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)