You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by James Peach <jo...@gmail.com> on 2017/09/08 17:14:54 UTC

[Design Doc] Native Support for Prometheus Metrics

Hi all,

This document discussed a proposal to implement native support for Prometheus metrics in libprocess. Note the proposal to alter how Timer metrics are exposed in an incompatible way (I argue this is OK because you can't really make use of these metrics now).

> https://docs.google.com/document/d/1j1TkckxGrKixvAUoz_TJRl-YayFMCNIUYv8Cq19Kal8 <https://docs.google.com/document/d/1j1TkckxGrKixvAUoz_TJRl-YayFMCNIUYv8Cq19Kal8>

thanks,
James

Re: [Design Doc] Native Support for Prometheus Metrics

Posted by James Peach <jo...@gmail.com>.

> On Sep 9, 2017, at 5:29 AM, Benjamin Bannier <be...@mesosphere.io> wrote:
> 
> Hi James,
> 
> I'd like to make a longer comment here to make it easier to discuss.
> 
>> [...]
>> 
>> Note the proposal to alter how Timer metrics are exposed in an incompatible
>> way (I argue this is OK because you can't really make use of these metrics
>> now).
> 
> I am not sure I follow your argument around `Timer`. It is similar to a gauge
> caching the last value and an associated statistics calculated from a time series.

I'm arguing that this does not provide useful semantics. When we think about the real-world objects we are representing with Timers, they don't look at all like what we represent with Gauges. For example, if I'm asked how much disk space is free, giving an instantaneous value with no reference to prior state (ie. Gauge) is informative and useful. Conversely, if I was asked to bill for my work time over the last month and I handed you a bill for the 10min because that was the last interval I worked, that answer is seriously unhelpful.

> I have never used Prometheus, but a brief look at the Prometheus
> docs seems to suggest that a `Timer` could be mapped onto a Prometheus summary
> type with minimal modifications (namely, by adding a `sum` value that you
> propose as sole replacement).

Right, that's what the current implementation does.

> I believe that exposing statistics is useful, and moving all `Timer` metrics to
> counters (cumulative value and number of samples) would leads to information
> loss.

I'm not proposing that we remove the Timer statistics. I am, however, proposing that representing a Timer as a cumulative count of elapsed time units makes it possible to actually use Timers for practical purposes. When we plotted the allocation_run Timer, would see the difference between full and partial allocation runs by the area under the graph. We could see the difference over time and we could even see how allocation runs behave across failover.

> Since most of your criticism of `Timer` is about it its associated statistics,

That wasn't my intention. The problem with the Timer is the value and count fields. While I did mention that I think a raw histogram would be more useful, I explicitly put that out of scope.

> maybe we can make fixes to libprocess' `TimeSeries` and the derived
> `Statistics` to make them more usable. Right now `Statistics` seems to be more
> apt for dealing with timing measurements where one probably worries more about
> the long tail of the distribution (it only exposes the median and higher
> percentiles). It seems that if one would e.g., make the exposed percentiles
> configurable, it should be possible to expose a useful characterization of the
> underlying distribution (think: box plot). It might be that one would need to
> revisit how `TimeSeries` sparsifies older data to make sure the quantiles we
> expose are meaningful.

I agree that is is possible to measure and improve the statistics. Probably how I'd approach this is to add extra instrumentation to capture all the raw Timer observations. Then I would attempt to show that the running percentile summary approximates the actual percentiles measured from the complete data.

>> First, note that the “allocator/mesos/allocation_run_ms/count” sample is not
>> useful at all. It has the semantics of a saturating counter that saturates at
>> the size of the bounded time series. To address this, there is another metric
>> “allocator/mesos/allocation_runs”, which tracks the actual count of
>> allocation runs (3161331.00 in this case). If you plot this counter over time
>> (ie. as a rate), it will be zero for all time once it reaches saturation. In
>> the case of allocation runs, this is almost all the time, since 1000
>> allocations will be performed within a few hours.
> 
> While `count` is not a useful measure of the behavior of the measured datum, it
> is critical to assess whether the derived statistic is meaningful (sample
> size). Like you write, it becomes less interesting once enough data was
> collected.

If the count doesn't saturate, it is always meaningful. If it is possible for the metric to become non-meaningful, that's pretty bad. I'm not sure I accept your premise here, though. Once the count saturates at 1000 samples, how do you know whether the statistics are for the last hour, or 3 hours ago? It is possible to accumulate no samples and for that to be invisible in the metrics.

>> Finally, while the derived statistics metrics can be informative, they are
>> actually less expressive than a raw histogram would be. A raw histogram of
>> timed values would allow an observer to distinguish cases where there are
>> clear performance bands (e.g. when allocation times cluster at either 15ms or
>> 200ms), but the percentile statistics obscure this information.
> 
> I would argue that is more a problem of `Statistics` only reporting percentiles
> from the far out, large value tail. Would e.g., the reported percentiles be
> placed more evenly it should be possible to recognize bands. After all
> percentiles are just samples from the cumulative distribution from which one
> can derived the underlying distribution (with some resolution) by taking a
> derivative.

I think that's a stretch. I guess that if you have a modal distribution you would recognize that because the difference between some quantiles is smaller than others? That's pretty non-obvious compared to what you would get with histogram data.

> Note that a poorly binned and ranged histogram can obscure features as well.
> Reporting percentiles/quantiles has the advantage of adapting to the data
> automatically.

Yes. This is why I'm not proposing to change statistics, focusing on what I believed to be the obvious, uncontroversial Timer changes.

J

Re: [Design Doc] Native Support for Prometheus Metrics

Posted by Benjamin Bannier <be...@mesosphere.io>.

Hi James,

I'd like to make a longer comment here to make it easier to discuss.

> [...]
> 
> Note the proposal to alter how Timer metrics are exposed in an incompatible
> way (I argue this is OK because you can't really make use of these metrics
> now).

I am not sure I follow your argument around `Timer`. It is similar to a gauge
caching the last value and an associated statistics calculated from a time series.

I have never used Prometheus, but a brief look at the Prometheus
docs seems to suggest that a `Timer` could be mapped onto a Prometheus summary
type with minimal modifications (namely, by adding a `sum` value that you
propose as sole replacement).

I believe that exposing statistics is useful, and moving all `Timer` metrics to
counters (cumulative value and number of samples) would leads to information
loss.

Since most of your criticism of `Timer` is about it its associated statistics,
maybe we can make fixes to libprocess' `TimeSeries` and the derived
`Statistics` to make them more usable. Right now `Statistics` seems to be more
apt for dealing with timing measurements where one probably worries more about
the long tail of the distribution (it only exposes the median and higher
percentiles). It seems that if one would e.g., make the exposed percentiles
configurable, it should be possible to expose a useful characterization of the
underlying distribution (think: box plot). It might be that one would need to
revisit how `TimeSeries` sparsifies older data to make sure the quantiles we
expose are meaningful.

> First, note that the “allocator/mesos/allocation_run_ms/count” sample is not
> useful at all. It has the semantics of a saturating counter that saturates at
> the size of the bounded time series. To address this, there is another metric
> “allocator/mesos/allocation_runs”, which tracks the actual count of
> allocation runs (3161331.00 in this case). If you plot this counter over time
> (ie. as a rate), it will be zero for all time once it reaches saturation. In
> the case of allocation runs, this is almost all the time, since 1000
> allocations will be performed within a few hours.

While `count` is not a useful measure of the behavior of the measured datum, it
is critical to assess whether the derived statistic is meaningful (sample
size). Like you write, it becomes less interesting once enough data was
collected.

> Finally, while the derived statistics metrics can be informative, they are
> actually less expressive than a raw histogram would be. A raw histogram of
> timed values would allow an observer to distinguish cases where there are
> clear performance bands (e.g. when allocation times cluster at either 15ms or
> 200ms), but the percentile statistics obscure this information.

I would argue that is more a problem of `Statistics` only reporting percentiles
from the far out, large value tail. Would e.g., the reported percentiles be
placed more evenly it should be possible to recognize bands. After all
percentiles are just samples from the cumulative distribution from which one
can derived the underlying distribution (with some resolution) by taking a
derivative.

Note that a poorly binned and ranged histogram can obscure features as well.
Reporting percentiles/quantiles has the advantage of adapting to the data
automatically.


Cheers,

Benjamin