You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Dominic Hamon (JIRA)" <ji...@apache.org> on 2014/05/30 21:47:03 UTC

[jira] [Resolved] (MESOS-1028) expose internal metrics

     [ https://issues.apache.org/jira/browse/MESOS-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dominic Hamon resolved MESOS-1028.
----------------------------------

    Resolution: Duplicate

> expose internal metrics
> -----------------------
>
>                 Key: MESOS-1028
>                 URL: https://issues.apache.org/jira/browse/MESOS-1028
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>            Reporter: David Robinson
>            Assignee: Dominic Hamon
>
> Mesos should export statistics that provide visibility into its internals. This would allow users to detect numerous problem without resorting to trolling log files.
> E.g. export counters of (some of these already exist, most don't):
> cgroup create
> cgroup destroy
> cgroup destroy attempts
> resource offers made
> resource offers accepted
> tasks launched
> tasks destroyed
> tasks lost
> writes to replicated log
> queue length
> export 50th, 90th, 95th, 99th percentile of time taken to:
> start mesos (reach a certain state)
> move tasks between two given states (starting -> started)
> create a cgroup
> destroy a cgroup
> send a message from slave to master
> start a task
> stop a task
> register in zookeeper
> write to the replicated log
> Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit Java) library (or [medida|http://dln.github.io/medida/] for an unmaintained(?) c++ port)
> We've previously seen problems where tasks were stuck in cgroup destroy with >30,000 attempts. Exposing metrics would allow us to easily detect problems like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)