You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Gaurav Garg (JIRA)" <ji...@apache.org> on 2019/05/03 21:02:00 UTC

[jira] [Created] (MESOS-9767) Add self health monitoring in Mesos master

Gaurav Garg created MESOS-9767:
----------------------------------

Summary: Add self health monitoring in Mesos master
Key: MESOS-9767
URL: https://issues.apache.org/jira/browse/MESOS-9767
Project: Mesos
Issue Type: Task
Components: master
Affects Versions: 1.6.2
Reporter: Gaurav Garg
Fix For: 1.7.2

We have seen issue where Mesos master got stuck and was not responding to HTTP endpoints like "/metrics/snapshot". This results in calls by the frameworks and metrics collector to the master to hang. Currently we emit 'master alive' metric using prometheus. If master hangs, this metrics is not published and we detect the hangs using alerts on top of this metrics. By the time someone would have got the alert and restarted the master process, 15-30mins would have passed by. This results in SLA violation by Mesos cluster users.

It will be nice to implement a self health check monitoring to detect if the Mesos master is hung/stuck. This will help us to quickly crash the master process so that one of the other member of the quorum can acquire ZK leadership lock.

We can use the "/master/health" endpoint for health checks.
Health checks can be initiated in [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] just after the child master process is [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]

We can leverage the [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]] for this one. One downside is that HealthChecker currently takes TaskId as an input which is not valid for master health check.

We can add following flags to control the self heath checking:
# self_monitoring_enabled: Whether self monitoring is enabled.
# self_monitoring_consecutive_failures: After this many number of health failures, master is crashed.
# self_monitoring_interval_secs: Interval at which health checks are performed.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)