You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Zhitao Li <zh...@gmail.com> on 2016/12/19 05:11:17 UTC

Metrics collection affected when libprocess queue builds up

Hi all,

While I was debugging an allocator message queue build up issue on master
(which I plan to share another thread), I noticed that `/metrics/snapshot`
is also badly affected.

For example, when the allocator queue has ~3k dispatches in it (revealed by
the allocator/mesos/event_queue_dispatches gauge), the `/metrics/snapshot`
could take 10-30 seconds to respond.

During an active debugging or outage fighting, this is pretty undesired.

My guess is that many stats collection code relies on *deferring* to
another libprocess and collect the result.

Should we explore a more reliable way to track metrics independently from
libprocess's queue?

-- 
Cheers,

Zhitao Li

Re: Metrics collection affected when libprocess queue builds up

Posted by Benjamin Mahler <bm...@apache.org>.

Yep, thanks!

For https://issues.apache.org/jira/browse/MESOS-6872 it sounds like you're
referring to the help information? We already list the timeout but perhaps
we need an example section in our help pages.
http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/

Or are you referring to the
http://mesos.apache.org/documentation/latest/monitoring/ page? That likely
warrants a NOTE as well as you mentioned earlier.

For https://issues.apache.org/jira/browse/MESOS-6873 I'd suggest we just
instead introduce the Gauge overload that takes a
'function<Future<double>>' (rather than a 'Deferred<Future<double>>')
(approach (3)) so that we can just write the version that calls
queue.size() outside of the actor's context. That won't impose an overhead,
unlike the counter approach where counting each incoming and outgoing event
may impose some non-trivial overhead on the event processing hot path.

On Fri, Jan 6, 2017 at 7:09 AM, Zhitao Li <zh...@gmail.com> wrote:

> Hi Benjamin,
>
> I've filed MESOS-6872 <https://issues.apache.org/jira/browse/MESOS-6872>
>  and MESOS-6873 <https://issues.apache.org/jira/browse/MESOS-6873> for doc
> and gauge change, and  will fix them. Can you shepherd these?
>
> I'll do another pass of other gauge usage in allocator to see whether there
> is easy low hanging fruits.
>
> Thanks.
>
> On Wed, Jan 4, 2017 at 6:28 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > A patch to update the documentation with a NOTE about this would be
> great.
> > It excludes all metrics that were not available within the timeout, there
> > is no indication within a particular result whether any timed out and
> were
> > excluded.
> >
> > My feeling is that taking the difference between enqueued and dequeued is
> > > not as obvious as a `Gauge`. If we take this path, we should document
> it
> > > clearly how to use the metric.
> >
> >
> > We could also explore avoiding the dispatch per (3) for this case, since
> > getting the size is a read only operation and should be thread-safe
> without
> > acquiring the lock.
> >
> > I'm also interested in how other `Gauge` type systems are implemented in
> > > other well know OSS projects. Maybe we can do some more research on
> their
> > > approach?
> >
> >
> > We did research when the metrics library was originally added. Coda
> Hale's
> > library (now called "dropwizard metrics" FWICT) was one of the libraries
> we
> > looked at, and its Gauge is equivalent to suggestion (3), as it was
> > designed for Java-style programs with locking as opposed to actors:
> >
> > http://metrics.dropwizard.io/3.1.0/getting-started/#gauges
> >
> > On Fri, Dec 30, 2016 at 10:34 AM, Zhitao Li <zh...@gmail.com>
> wrote:
> >
> > > Hi Benjamin,
> > >
> > > Thanks for the response. First time heard of the `timeout` parameter.
> > I'll
> > > fix our monitoring scripts to always specify this.
> > >
> > > One question on timeout: does it simply drop any metric callback which
> is
> > > not collected within the timeout? Does caller know which metrics are
> > > dropped due to timeout?
> > >
> > > Also, it is not documented in
> > > http://mesos.apache.org/documentation/latest/monitoring/. Should we
> > send a
> > > patch to update it?
> > >
> > > On Tue, Dec 27, 2016 at 6:12 PM, Benjamin Mahler <bm...@apache.org>
> > > wrote:
> > >
> > > > The /metrics endpoint exposes a timeout parameter if you want to
> > receive
> > > a
> > > > response with all of the metrics that were available within the
> > timeout,
> > > > e.g. /metrics/snapshot.json?timeout=10secs
> > > >
> > > > I'd recommend using this when collecting metrics so that you can
> > maintain
> > > > visibility when a particular component is backlogged.
> > > >
> > > > Should we explore a more reliable way to track metrics independently
> > from
> > > > > libprocess's queue?
> > > >
> > > >
> > > > Note that this problem applies only to our defer-based "Gauge"
> metrics
> > > that
> > > > execute on the actor. Counters and Timers are immune to this. I would
> > say
> > > > there are a couple of improvements we can make in increasing order of
> > > > difficulty:
> > > >
> > > > (1) There are instances of Gauges that might be better represented as
> > > > Counters. For example, we expose the actor queue sizes using a gauge
> > > (known
> > > > to be unfortunate!), when instead we could expose two counters for
> > > > "enqueued" and "dequeued" messages and infer size from these. We can
> > also
> > > > add the ability for callers to manually increment and decrement their
> > > > Gauges rather than go through a dispatch.
> > >
> > >
> > > My feeling is that taking the difference between enqueued and dequeued
> is
> > > not as obvious as a `Gauge`. If we take this path, we should document
> it
> > > clearly how to use the metric.
> > >
> > >
> > > >
> > > > (2) Allow Gauge dispatches to be sent to the front of the actor's
> > queue,
> > > > rather than the back. I would hope that we don't wind up with a
> notion
> > of
> > > > integer priority for messages. Note that this doesn't solve the
> problem
> > > for
> > > > when the "backlog" is occurring inside a single expensive function.
> It
> > > also
> > > > has the issue of preventing "progress" if metrics are hit frequently
> > > enough
> > > > and are expensive enough.
> > >
> > >
> > > > (3) There are instances of Gauges that might be better represented as
> > > > thread-safe logic. For example, if we need an actor's std::map
> member's
> > > > .size(), we could call .size() safely so long as the map is not
> > > destructed.
> > > > In other cases, explicit locking may be needed and is more
> complicated.
> > > >
> > > > (4) There are instances of Gauges that might be better represented
> as a
> > > > "wrapping" around a data-structure. For example, the std::map could
> be
> > > > wrapped as a 'map_wrapper' that injects metric updates into each
> > > non-const
> > > > operation that affects the size of the map.
> > > >
> > > > So far I've felt that the timeout and (1) will be sufficient for the
> > > > foreseeable future, while (3) and (4) seem to require a significant
> > > impact
> > > > to non-metrics related code complexity, let me know what you think.
> > > >
> > >
> > > I agree that we should not adopt (2) only to address this problem: it
> > seems
> > > like something larger and also affects how libprocess was generally
> > > designed, so we should think more carefully about that.
> > >
> > > I like the idea of (3) since it can be implemented gradually, and it
> can
> > > completely avoid paying the cost of enque/deque message (which is
> another
> > > interesting question: how expensive it could be?)
> > >
> > > (4) seems like a bigger
> > >
> > > I'm also interested in how other `Gauge` type systems are implemented
> in
> > > other well know OSS projects. Maybe we can do some more research on
> their
> > > approach?
> > >
> > >
> > > > Ben
> > > >
> > > > On Mon, Dec 19, 2016 at 6:32 PM, Zameer Manji <zm...@apache.org>
> > wrote:
> > > >
> > > > > I believe Zhitao is referring to `/metrics/snapshot` returning a
> > result
> > > > > after 10-30 seconds.
> > > > >
> > > > > I think in a typical environment, this will cause most metrics
> > > collection
> > > > > tooling to timeout. This causes the operator to not have any
> > visibility
> > > > > into the system, making debugging/fighting the problem very hard.
> > > > >
> > > > > On Mon, Dec 19, 2016 at 9:23 PM, haosdent <ha...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi, @zhitao
> > > > > >
> > > > > > > the `/metrics/snapshot` could take 10-30 seconds to respond.
> > > > > >
> > > > > > Do you mean it `/metrics/snapshot` return result after 10~30
> > seconds?
> > > > > > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change
> > of `
> > > > > > allocator/mesos/event_queue_dispatches gauge`?
> > > > > >
> > > > > > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <
> zhitaoli.cs@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > While I was debugging an allocator message queue build up issue
> > on
> > > > > master
> > > > > > > (which I plan to share another thread), I noticed that
> > > > > > `/metrics/snapshot`
> > > > > > > is also badly affected.
> > > > > > >
> > > > > > > For example, when the allocator queue has ~3k dispatches in it
> > > > > (revealed
> > > > > > by
> > > > > > > the allocator/mesos/event_queue_dispatches gauge), the
> > > > > > `/metrics/snapshot`
> > > > > > > could take 10-30 seconds to respond.
> > > > > > >
> > > > > > > During an active debugging or outage fighting, this is pretty
> > > > > undesired.
> > > > > > >
> > > > > > > My guess is that many stats collection code relies on
> *deferring*
> > > to
> > > > > > > another libprocess and collect the result.
> > > > > > >
> > > > > > > Should we explore a more reliable way to track metrics
> > > independently
> > > > > from
> > > > > > > libprocess's queue?
> > > > > > >
> > > > > > > --
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Zhitao Li
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > > Haosdent Huang
> > > > > >
> > > > > > --
> > > > > > Zameer Manji
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> > >
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>

Re: Metrics collection affected when libprocess queue builds up

Posted by Zhitao Li <zh...@gmail.com>.

Hi Benjamin,

I've filed MESOS-6872 <https://issues.apache.org/jira/browse/MESOS-6872>
 and MESOS-6873 <https://issues.apache.org/jira/browse/MESOS-6873> for doc
and gauge change, and  will fix them. Can you shepherd these?

I'll do another pass of other gauge usage in allocator to see whether there
is easy low hanging fruits.

Thanks.

On Wed, Jan 4, 2017 at 6:28 PM, Benjamin Mahler <bm...@apache.org> wrote:

> A patch to update the documentation with a NOTE about this would be great.
> It excludes all metrics that were not available within the timeout, there
> is no indication within a particular result whether any timed out and were
> excluded.
>
> My feeling is that taking the difference between enqueued and dequeued is
> > not as obvious as a `Gauge`. If we take this path, we should document it
> > clearly how to use the metric.
>
>
> We could also explore avoiding the dispatch per (3) for this case, since
> getting the size is a read only operation and should be thread-safe without
> acquiring the lock.
>
> I'm also interested in how other `Gauge` type systems are implemented in
> > other well know OSS projects. Maybe we can do some more research on their
> > approach?
>
>
> We did research when the metrics library was originally added. Coda Hale's
> library (now called "dropwizard metrics" FWICT) was one of the libraries we
> looked at, and its Gauge is equivalent to suggestion (3), as it was
> designed for Java-style programs with locking as opposed to actors:
>
> http://metrics.dropwizard.io/3.1.0/getting-started/#gauges
>
> On Fri, Dec 30, 2016 at 10:34 AM, Zhitao Li <zh...@gmail.com> wrote:
>
> > Hi Benjamin,
> >
> > Thanks for the response. First time heard of the `timeout` parameter.
> I'll
> > fix our monitoring scripts to always specify this.
> >
> > One question on timeout: does it simply drop any metric callback which is
> > not collected within the timeout? Does caller know which metrics are
> > dropped due to timeout?
> >
> > Also, it is not documented in
> > http://mesos.apache.org/documentation/latest/monitoring/. Should we
> send a
> > patch to update it?
> >
> > On Tue, Dec 27, 2016 at 6:12 PM, Benjamin Mahler <bm...@apache.org>
> > wrote:
> >
> > > The /metrics endpoint exposes a timeout parameter if you want to
> receive
> > a
> > > response with all of the metrics that were available within the
> timeout,
> > > e.g. /metrics/snapshot.json?timeout=10secs
> > >
> > > I'd recommend using this when collecting metrics so that you can
> maintain
> > > visibility when a particular component is backlogged.
> > >
> > > Should we explore a more reliable way to track metrics independently
> from
> > > > libprocess's queue?
> > >
> > >
> > > Note that this problem applies only to our defer-based "Gauge" metrics
> > that
> > > execute on the actor. Counters and Timers are immune to this. I would
> say
> > > there are a couple of improvements we can make in increasing order of
> > > difficulty:
> > >
> > > (1) There are instances of Gauges that might be better represented as
> > > Counters. For example, we expose the actor queue sizes using a gauge
> > (known
> > > to be unfortunate!), when instead we could expose two counters for
> > > "enqueued" and "dequeued" messages and infer size from these. We can
> also
> > > add the ability for callers to manually increment and decrement their
> > > Gauges rather than go through a dispatch.
> >
> >
> > My feeling is that taking the difference between enqueued and dequeued is
> > not as obvious as a `Gauge`. If we take this path, we should document it
> > clearly how to use the metric.
> >
> >
> > >
> > > (2) Allow Gauge dispatches to be sent to the front of the actor's
> queue,
> > > rather than the back. I would hope that we don't wind up with a notion
> of
> > > integer priority for messages. Note that this doesn't solve the problem
> > for
> > > when the "backlog" is occurring inside a single expensive function. It
> > also
> > > has the issue of preventing "progress" if metrics are hit frequently
> > enough
> > > and are expensive enough.
> >
> >
> > > (3) There are instances of Gauges that might be better represented as
> > > thread-safe logic. For example, if we need an actor's std::map member's
> > > .size(), we could call .size() safely so long as the map is not
> > destructed.
> > > In other cases, explicit locking may be needed and is more complicated.
> > >
> > > (4) There are instances of Gauges that might be better represented as a
> > > "wrapping" around a data-structure. For example, the std::map could be
> > > wrapped as a 'map_wrapper' that injects metric updates into each
> > non-const
> > > operation that affects the size of the map.
> > >
> > > So far I've felt that the timeout and (1) will be sufficient for the
> > > foreseeable future, while (3) and (4) seem to require a significant
> > impact
> > > to non-metrics related code complexity, let me know what you think.
> > >
> >
> > I agree that we should not adopt (2) only to address this problem: it
> seems
> > like something larger and also affects how libprocess was generally
> > designed, so we should think more carefully about that.
> >
> > I like the idea of (3) since it can be implemented gradually, and it can
> > completely avoid paying the cost of enque/deque message (which is another
> > interesting question: how expensive it could be?)
> >
> > (4) seems like a bigger
> >
> > I'm also interested in how other `Gauge` type systems are implemented in
> > other well know OSS projects. Maybe we can do some more research on their
> > approach?
> >
> >
> > > Ben
> > >
> > > On Mon, Dec 19, 2016 at 6:32 PM, Zameer Manji <zm...@apache.org>
> wrote:
> > >
> > > > I believe Zhitao is referring to `/metrics/snapshot` returning a
> result
> > > > after 10-30 seconds.
> > > >
> > > > I think in a typical environment, this will cause most metrics
> > collection
> > > > tooling to timeout. This causes the operator to not have any
> visibility
> > > > into the system, making debugging/fighting the problem very hard.
> > > >
> > > > On Mon, Dec 19, 2016 at 9:23 PM, haosdent <ha...@gmail.com>
> wrote:
> > > >
> > > > > Hi, @zhitao
> > > > >
> > > > > > the `/metrics/snapshot` could take 10-30 seconds to respond.
> > > > >
> > > > > Do you mean it `/metrics/snapshot` return result after 10~30
> seconds?
> > > > > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change
> of `
> > > > > allocator/mesos/event_queue_dispatches gauge`?
> > > > >
> > > > > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zh...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > While I was debugging an allocator message queue build up issue
> on
> > > > master
> > > > > > (which I plan to share another thread), I noticed that
> > > > > `/metrics/snapshot`
> > > > > > is also badly affected.
> > > > > >
> > > > > > For example, when the allocator queue has ~3k dispatches in it
> > > > (revealed
> > > > > by
> > > > > > the allocator/mesos/event_queue_dispatches gauge), the
> > > > > `/metrics/snapshot`
> > > > > > could take 10-30 seconds to respond.
> > > > > >
> > > > > > During an active debugging or outage fighting, this is pretty
> > > > undesired.
> > > > > >
> > > > > > My guess is that many stats collection code relies on *deferring*
> > to
> > > > > > another libprocess and collect the result.
> > > > > >
> > > > > > Should we explore a more reliable way to track metrics
> > independently
> > > > from
> > > > > > libprocess's queue?
> > > > > >
> > > > > > --
> > > > > > Cheers,
> > > > > >
> > > > > > Zhitao Li
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > > Haosdent Huang
> > > > >
> > > > > --
> > > > > Zameer Manji
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>



-- 
Cheers,

Zhitao Li

Re: Metrics collection affected when libprocess queue builds up

Posted by Benjamin Mahler <bm...@apache.org>.

A patch to update the documentation with a NOTE about this would be great.
It excludes all metrics that were not available within the timeout, there
is no indication within a particular result whether any timed out and were
excluded.

My feeling is that taking the difference between enqueued and dequeued is
> not as obvious as a `Gauge`. If we take this path, we should document it
> clearly how to use the metric.


We could also explore avoiding the dispatch per (3) for this case, since
getting the size is a read only operation and should be thread-safe without
acquiring the lock.

I'm also interested in how other `Gauge` type systems are implemented in
> other well know OSS projects. Maybe we can do some more research on their
> approach?


We did research when the metrics library was originally added. Coda Hale's
library (now called "dropwizard metrics" FWICT) was one of the libraries we
looked at, and its Gauge is equivalent to suggestion (3), as it was
designed for Java-style programs with locking as opposed to actors:

http://metrics.dropwizard.io/3.1.0/getting-started/#gauges

On Fri, Dec 30, 2016 at 10:34 AM, Zhitao Li <zh...@gmail.com> wrote:

> Hi Benjamin,
>
> Thanks for the response. First time heard of the `timeout` parameter. I'll
> fix our monitoring scripts to always specify this.
>
> One question on timeout: does it simply drop any metric callback which is
> not collected within the timeout? Does caller know which metrics are
> dropped due to timeout?
>
> Also, it is not documented in
> http://mesos.apache.org/documentation/latest/monitoring/. Should we send a
> patch to update it?
>
> On Tue, Dec 27, 2016 at 6:12 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > The /metrics endpoint exposes a timeout parameter if you want to receive
> a
> > response with all of the metrics that were available within the timeout,
> > e.g. /metrics/snapshot.json?timeout=10secs
> >
> > I'd recommend using this when collecting metrics so that you can maintain
> > visibility when a particular component is backlogged.
> >
> > Should we explore a more reliable way to track metrics independently from
> > > libprocess's queue?
> >
> >
> > Note that this problem applies only to our defer-based "Gauge" metrics
> that
> > execute on the actor. Counters and Timers are immune to this. I would say
> > there are a couple of improvements we can make in increasing order of
> > difficulty:
> >
> > (1) There are instances of Gauges that might be better represented as
> > Counters. For example, we expose the actor queue sizes using a gauge
> (known
> > to be unfortunate!), when instead we could expose two counters for
> > "enqueued" and "dequeued" messages and infer size from these. We can also
> > add the ability for callers to manually increment and decrement their
> > Gauges rather than go through a dispatch.
>
>
> My feeling is that taking the difference between enqueued and dequeued is
> not as obvious as a `Gauge`. If we take this path, we should document it
> clearly how to use the metric.
>
>
> >
> > (2) Allow Gauge dispatches to be sent to the front of the actor's queue,
> > rather than the back. I would hope that we don't wind up with a notion of
> > integer priority for messages. Note that this doesn't solve the problem
> for
> > when the "backlog" is occurring inside a single expensive function. It
> also
> > has the issue of preventing "progress" if metrics are hit frequently
> enough
> > and are expensive enough.
>
>
> > (3) There are instances of Gauges that might be better represented as
> > thread-safe logic. For example, if we need an actor's std::map member's
> > .size(), we could call .size() safely so long as the map is not
> destructed.
> > In other cases, explicit locking may be needed and is more complicated.
> >
> > (4) There are instances of Gauges that might be better represented as a
> > "wrapping" around a data-structure. For example, the std::map could be
> > wrapped as a 'map_wrapper' that injects metric updates into each
> non-const
> > operation that affects the size of the map.
> >
> > So far I've felt that the timeout and (1) will be sufficient for the
> > foreseeable future, while (3) and (4) seem to require a significant
> impact
> > to non-metrics related code complexity, let me know what you think.
> >
>
> I agree that we should not adopt (2) only to address this problem: it seems
> like something larger and also affects how libprocess was generally
> designed, so we should think more carefully about that.
>
> I like the idea of (3) since it can be implemented gradually, and it can
> completely avoid paying the cost of enque/deque message (which is another
> interesting question: how expensive it could be?)
>
> (4) seems like a bigger
>
> I'm also interested in how other `Gauge` type systems are implemented in
> other well know OSS projects. Maybe we can do some more research on their
> approach?
>
>
> > Ben
> >
> > On Mon, Dec 19, 2016 at 6:32 PM, Zameer Manji <zm...@apache.org> wrote:
> >
> > > I believe Zhitao is referring to `/metrics/snapshot` returning a result
> > > after 10-30 seconds.
> > >
> > > I think in a typical environment, this will cause most metrics
> collection
> > > tooling to timeout. This causes the operator to not have any visibility
> > > into the system, making debugging/fighting the problem very hard.
> > >
> > > On Mon, Dec 19, 2016 at 9:23 PM, haosdent <ha...@gmail.com> wrote:
> > >
> > > > Hi, @zhitao
> > > >
> > > > > the `/metrics/snapshot` could take 10-30 seconds to respond.
> > > >
> > > > Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> > > > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> > > > allocator/mesos/event_queue_dispatches gauge`?
> > > >
> > > > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zh...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > While I was debugging an allocator message queue build up issue on
> > > master
> > > > > (which I plan to share another thread), I noticed that
> > > > `/metrics/snapshot`
> > > > > is also badly affected.
> > > > >
> > > > > For example, when the allocator queue has ~3k dispatches in it
> > > (revealed
> > > > by
> > > > > the allocator/mesos/event_queue_dispatches gauge), the
> > > > `/metrics/snapshot`
> > > > > could take 10-30 seconds to respond.
> > > > >
> > > > > During an active debugging or outage fighting, this is pretty
> > > undesired.
> > > > >
> > > > > My guess is that many stats collection code relies on *deferring*
> to
> > > > > another libprocess and collect the result.
> > > > >
> > > > > Should we explore a more reliable way to track metrics
> independently
> > > from
> > > > > libprocess's queue?
> > > > >
> > > > > --
> > > > > Cheers,
> > > > >
> > > > > Zhitao Li
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > > >
> > > > --
> > > > Zameer Manji
> > > >
> > >
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>

Re: Metrics collection affected when libprocess queue builds up

Posted by Zhitao Li <zh...@gmail.com>.

Hi Benjamin,

Thanks for the response. First time heard of the `timeout` parameter. I'll
fix our monitoring scripts to always specify this.

One question on timeout: does it simply drop any metric callback which is
not collected within the timeout? Does caller know which metrics are
dropped due to timeout?

Also, it is not documented in
http://mesos.apache.org/documentation/latest/monitoring/. Should we send a
patch to update it?

On Tue, Dec 27, 2016 at 6:12 PM, Benjamin Mahler <bm...@apache.org> wrote:

> The /metrics endpoint exposes a timeout parameter if you want to receive a
> response with all of the metrics that were available within the timeout,
> e.g. /metrics/snapshot.json?timeout=10secs
>
> I'd recommend using this when collecting metrics so that you can maintain
> visibility when a particular component is backlogged.
>
> Should we explore a more reliable way to track metrics independently from
> > libprocess's queue?
>
>
> Note that this problem applies only to our defer-based "Gauge" metrics that
> execute on the actor. Counters and Timers are immune to this. I would say
> there are a couple of improvements we can make in increasing order of
> difficulty:
>
> (1) There are instances of Gauges that might be better represented as
> Counters. For example, we expose the actor queue sizes using a gauge (known
> to be unfortunate!), when instead we could expose two counters for
> "enqueued" and "dequeued" messages and infer size from these. We can also
> add the ability for callers to manually increment and decrement their
> Gauges rather than go through a dispatch.


My feeling is that taking the difference between enqueued and dequeued is
not as obvious as a `Gauge`. If we take this path, we should document it
clearly how to use the metric.


>
> (2) Allow Gauge dispatches to be sent to the front of the actor's queue,
> rather than the back. I would hope that we don't wind up with a notion of
> integer priority for messages. Note that this doesn't solve the problem for
> when the "backlog" is occurring inside a single expensive function. It also
> has the issue of preventing "progress" if metrics are hit frequently enough
> and are expensive enough.


> (3) There are instances of Gauges that might be better represented as
> thread-safe logic. For example, if we need an actor's std::map member's
> .size(), we could call .size() safely so long as the map is not destructed.
> In other cases, explicit locking may be needed and is more complicated.
>
> (4) There are instances of Gauges that might be better represented as a
> "wrapping" around a data-structure. For example, the std::map could be
> wrapped as a 'map_wrapper' that injects metric updates into each non-const
> operation that affects the size of the map.
>
> So far I've felt that the timeout and (1) will be sufficient for the
> foreseeable future, while (3) and (4) seem to require a significant impact
> to non-metrics related code complexity, let me know what you think.
>

I agree that we should not adopt (2) only to address this problem: it seems
like something larger and also affects how libprocess was generally
designed, so we should think more carefully about that.

I like the idea of (3) since it can be implemented gradually, and it can
completely avoid paying the cost of enque/deque message (which is another
interesting question: how expensive it could be?)

(4) seems like a bigger

I'm also interested in how other `Gauge` type systems are implemented in
other well know OSS projects. Maybe we can do some more research on their
approach?


> Ben
>
> On Mon, Dec 19, 2016 at 6:32 PM, Zameer Manji <zm...@apache.org> wrote:
>
> > I believe Zhitao is referring to `/metrics/snapshot` returning a result
> > after 10-30 seconds.
> >
> > I think in a typical environment, this will cause most metrics collection
> > tooling to timeout. This causes the operator to not have any visibility
> > into the system, making debugging/fighting the problem very hard.
> >
> > On Mon, Dec 19, 2016 at 9:23 PM, haosdent <ha...@gmail.com> wrote:
> >
> > > Hi, @zhitao
> > >
> > > > the `/metrics/snapshot` could take 10-30 seconds to respond.
> > >
> > > Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> > > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> > > allocator/mesos/event_queue_dispatches gauge`?
> > >
> > > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zh...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > While I was debugging an allocator message queue build up issue on
> > master
> > > > (which I plan to share another thread), I noticed that
> > > `/metrics/snapshot`
> > > > is also badly affected.
> > > >
> > > > For example, when the allocator queue has ~3k dispatches in it
> > (revealed
> > > by
> > > > the allocator/mesos/event_queue_dispatches gauge), the
> > > `/metrics/snapshot`
> > > > could take 10-30 seconds to respond.
> > > >
> > > > During an active debugging or outage fighting, this is pretty
> > undesired.
> > > >
> > > > My guess is that many stats collection code relies on *deferring* to
> > > > another libprocess and collect the result.
> > > >
> > > > Should we explore a more reliable way to track metrics independently
> > from
> > > > libprocess's queue?
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > Zhitao Li
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> > >
> > > --
> > > Zameer Manji
> > >
> >
>



-- 
Cheers,

Zhitao Li

Re: Metrics collection affected when libprocess queue builds up

Posted by Benjamin Mahler <bm...@apache.org>.

The /metrics endpoint exposes a timeout parameter if you want to receive a
response with all of the metrics that were available within the timeout,
e.g. /metrics/snapshot.json?timeout=10secs

I'd recommend using this when collecting metrics so that you can maintain
visibility when a particular component is backlogged.

Should we explore a more reliable way to track metrics independently from
> libprocess's queue?

Note that this problem applies only to our defer-based "Gauge" metrics that
execute on the actor. Counters and Timers are immune to this. I would say
there are a couple of improvements we can make in increasing order of
difficulty:

(1) There are instances of Gauges that might be better represented as
Counters. For example, we expose the actor queue sizes using a gauge (known
to be unfortunate!), when instead we could expose two counters for
"enqueued" and "dequeued" messages and infer size from these. We can also
add the ability for callers to manually increment and decrement their
Gauges rather than go through a dispatch.

(2) Allow Gauge dispatches to be sent to the front of the actor's queue,
rather than the back. I would hope that we don't wind up with a notion of
integer priority for messages. Note that this doesn't solve the problem for
when the "backlog" is occurring inside a single expensive function. It also
has the issue of preventing "progress" if metrics are hit frequently enough
and are expensive enough.

(3) There are instances of Gauges that might be better represented as
thread-safe logic. For example, if we need an actor's std::map member's
.size(), we could call .size() safely so long as the map is not destructed.
In other cases, explicit locking may be needed and is more complicated.

(4) There are instances of Gauges that might be better represented as a
"wrapping" around a data-structure. For example, the std::map could be
wrapped as a 'map_wrapper' that injects metric updates into each non-const
operation that affects the size of the map.

So far I've felt that the timeout and (1) will be sufficient for the
foreseeable future, while (3) and (4) seem to require a significant impact
to non-metrics related code complexity, let me know what you think.

Ben

On Mon, Dec 19, 2016 at 6:32 PM, Zameer Manji <zm...@apache.org> wrote:

> I believe Zhitao is referring to `/metrics/snapshot` returning a result
> after 10-30 seconds.
>
> I think in a typical environment, this will cause most metrics collection
> tooling to timeout. This causes the operator to not have any visibility
> into the system, making debugging/fighting the problem very hard.
>
> On Mon, Dec 19, 2016 at 9:23 PM, haosdent <ha...@gmail.com> wrote:
>
> > Hi, @zhitao
> >
> > > the `/metrics/snapshot` could take 10-30 seconds to respond.
> >
> > Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> > allocator/mesos/event_queue_dispatches gauge`?
> >
> > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zh...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > While I was debugging an allocator message queue build up issue on
> master
> > > (which I plan to share another thread), I noticed that
> > `/metrics/snapshot`
> > > is also badly affected.
> > >
> > > For example, when the allocator queue has ~3k dispatches in it
> (revealed
> > by
> > > the allocator/mesos/event_queue_dispatches gauge), the
> > `/metrics/snapshot`
> > > could take 10-30 seconds to respond.
> > >
> > > During an active debugging or outage fighting, this is pretty
> undesired.
> > >
> > > My guess is that many stats collection code relies on *deferring* to
> > > another libprocess and collect the result.
> > >
> > > Should we explore a more reliable way to track metrics independently
> from
> > > libprocess's queue?
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> > >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
> > --
> > Zameer Manji
> >
>

Re: Metrics collection affected when libprocess queue builds up

Posted by Zameer Manji <zm...@apache.org>.

I believe Zhitao is referring to `/metrics/snapshot` returning a result
after 10-30 seconds.

I think in a typical environment, this will cause most metrics collection
tooling to timeout. This causes the operator to not have any visibility
into the system, making debugging/fighting the problem very hard.

On Mon, Dec 19, 2016 at 9:23 PM, haosdent <ha...@gmail.com> wrote:

> Hi, @zhitao
>
> > the `/metrics/snapshot` could take 10-30 seconds to respond.
>
> Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> allocator/mesos/event_queue_dispatches gauge`?
>
> On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zh...@gmail.com> wrote:
>
> > Hi all,
> >
> > While I was debugging an allocator message queue build up issue on master
> > (which I plan to share another thread), I noticed that
> `/metrics/snapshot`
> > is also badly affected.
> >
> > For example, when the allocator queue has ~3k dispatches in it (revealed
> by
> > the allocator/mesos/event_queue_dispatches gauge), the
> `/metrics/snapshot`
> > could take 10-30 seconds to respond.
> >
> > During an active debugging or outage fighting, this is pretty undesired.
> >
> > My guess is that many stats collection code relies on *deferring* to
> > another libprocess and collect the result.
> >
> > Should we explore a more reliable way to track metrics independently from
> > libprocess's queue?
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
> --
> Zameer Manji
>

Re: Metrics collection affected when libprocess queue builds up

Posted by haosdent <ha...@gmail.com>.

Hi, @zhitao

> the `/metrics/snapshot` could take 10-30 seconds to respond.

Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
allocator/mesos/event_queue_dispatches gauge`?

On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zh...@gmail.com> wrote:

> Hi all,
>
> While I was debugging an allocator message queue build up issue on master
> (which I plan to share another thread), I noticed that `/metrics/snapshot`
> is also badly affected.
>
> For example, when the allocator queue has ~3k dispatches in it (revealed by
> the allocator/mesos/event_queue_dispatches gauge), the `/metrics/snapshot`
> could take 10-30 seconds to respond.
>
> During an active debugging or outage fighting, this is pretty undesired.
>
> My guess is that many stats collection code relies on *deferring* to
> another libprocess and collect the result.
>
> Should we explore a more reliable way to track metrics independently from
> libprocess's queue?
>
> --
> Cheers,
>
> Zhitao Li
>



-- 
Best Regards,
Haosdent Huang