You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2017/06/22 15:59:00 UTC

[jira] [Created] (MESOS-7710) Mesos agent registration retry backoff window always has a zero lower-bound

Yan Xu created MESOS-7710:
-----------------------------

             Summary: Mesos agent registration retry backoff window always has a zero lower-bound
                 Key: MESOS-7710
                 URL: https://issues.apache.org/jira/browse/MESOS-7710
             Project: Mesos
          Issue Type: Bug
          Components: agent, master
            Reporter: Yan Xu


In a large cluster when the master fails over, agents retry reregistration with a backoff algorithm that expands a randomization window with its lower bound stays zero. However in such a situation the master is heavily backlogged so even if it's just a portion of the agents that are retrying too fast it still aggravates the situation for everyone. 

The proposal is to increase the lower bound during the backoff. However we should probably not create a customized backoff algorithm for this particular case but have it depend on generic solution MESOS-7646. 

This shouldn't increase the burden of the operator by requiring them to tune these parameters according to cluster size but rather rely on sensible defaults.

To combat dropped messages, this perhaps works better with MESOS-7688: if the agents only start reregistration when the master is recovered, then it's more reasonable to backoff more aggressively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)