You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@aurora.apache.org by "Renan DelValle (JIRA)" <ji...@apache.org> on 2017/07/18 21:12:00 UTC

[jira] [Created] (AURORA-1942) Improve Aurora behavior with regards to Mesos Agents violating reregistration timeouts

Renan DelValle created AURORA-1942:
--------------------------------------

             Summary: Improve Aurora behavior with regards to Mesos Agents violating reregistration timeouts
                 Key: AURORA-1942
                 URL: https://issues.apache.org/jira/browse/AURORA-1942
             Project: Aurora
          Issue Type: Task
          Components: Scheduler
            Reporter: Renan DelValle


A Mesos Agent Lost message can be received in two scenarios resulting in different outcomes:

1) A Mesos Agent can fail the health check done by the Mesos Master (max_agent_ping_timeouts violation) which leads to an Agent Lost message along with TASK_LOST messages for each task running on the unhealthy Agent.

2) A Mesos Agent can fail to re-register after an election has taken place (agent_reregister_timeout violation). In this situation the newly elected Mesos master, because Master's do not store any information concerning the tasks that are currently running, is unable to send a TASK_LOST message for the tasks that were running on the Agent that failed to re-register.

Scenario number 2 can lead to (a) "missing" instances for the tasks scheduled on the rogue Agent until an explicit reconciliation is done and/or (b) "leaked" tasks if the Agent re-registers after Aurora has replaced the missing tasks that will only be cleaned upon an implicit reconciliation.

For (a), one solution is to transition tasks in a missing Agent to the LOST state upon receiving a Slave Lost message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)