You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Dominic Hamon (JIRA)" <ji...@apache.org> on 2014/05/30 21:47:02 UTC

[jira] [Commented] (MESOS-1028) expose internal metrics

    [ https://issues.apache.org/jira/browse/MESOS-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014154#comment-14014154 ] 

Dominic Hamon commented on MESOS-1028:
--------------------------------------

Closing this as some of these metrics are being exported, others are covered by other tickets. If there are specific metrics that should be exported, please open individual tickets for subsystems.

> expose internal metrics
> -----------------------
>
>                 Key: MESOS-1028
>                 URL: https://issues.apache.org/jira/browse/MESOS-1028
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>            Reporter: David Robinson
>            Assignee: Dominic Hamon
>
> Mesos should export statistics that provide visibility into its internals. This would allow users to detect numerous problem without resorting to trolling log files.
> E.g. export counters of (some of these already exist, most don't):
> cgroup create
> cgroup destroy
> cgroup destroy attempts
> resource offers made
> resource offers accepted
> tasks launched
> tasks destroyed
> tasks lost
> writes to replicated log
> queue length
> export 50th, 90th, 95th, 99th percentile of time taken to:
> start mesos (reach a certain state)
> move tasks between two given states (starting -> started)
> create a cgroup
> destroy a cgroup
> send a message from slave to master
> start a task
> stop a task
> register in zookeeper
> write to the replicated log
> Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit Java) library (or [medida|http://dln.github.io/medida/] for an unmaintained(?) c++ port)
> We've previously seen problems where tasks were stuck in cgroup destroy with >30,000 attempts. Exposing metrics would allow us to easily detect problems like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)