You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2018/11/02 18:01:00 UTC

[jira] [Created] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

Yan Xu created MESOS-9368:
-----------------------------

             Summary: The agent can be resending status updates too aggressively and the backoff is not configurable
                 Key: MESOS-9368
                 URL: https://issues.apache.org/jira/browse/MESOS-9368
             Project: Mesos
          Issue Type: Bug
            Reporter: Yan Xu


The current behavior is that when the agent queue status updates in a "stream" which has an exponential backoff window from 10secs to 10mins. In each retry the front of the queue is sent so if multiple statuses are queued up, subsequent ones are not attempted unless the first one is acked. So if the frameworks are for some reason not able to ack at all, there is one update per task in flight at a time.

If in a cluster we have 500,000 tasks with pending status updates and the master fails over, after each agent is reregistered it starts to send these updates or we are looking at 500,000 updates ~immediately + 500,000 updates 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.

Given that the initial communication of task state is covered by the agent reregistration message and the framework reconciliation requests, it seems that we can safely reduce the retry frequency further, optionally of course. It's not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)