You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Michał Łowicki <ml...@gmail.com> on 2016/07/07 19:48:49 UTC

Monitoring at container level

Hi,

Before introducing Mesos we're using mainly Graphite / Grafana. Ideally we
would like to have metrics per container as an easy way to detect if
problem touches only single, subset of containers or it's global.

Unfortunately using Graphite for that is far from being perfect. Having
container identifier as a part of metric has many negative implications
like having tons of new metrics every release on Marathon (new containers =
new identifiers).

Investigated InfluxDB so far but project isn't mature enough as still
components like
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd
have major blockers:

COMING SOON: there will be a way to specify multiple fields.


What do you use to monitor your Mesos clusters and f.ex. to detect that
some containers are having issues?

-- 
BR,
Michał Łowicki

Re: Monitoring at container level

Posted by Michał Łowicki <ml...@gmail.com>.

For now the easiest solution not requiring changing monitoring
infrastructure would be the one proposed by Steven Schlansker so will try
to get some information about it from Marathon team. Thanks!

On Fri, Jul 8, 2016 at 5:20 AM, <co...@gmail.com> wrote:

> Small plug for snap (https://github.com/intelsdi-x/snap). It's a
> telemetry framework with a lot of useful plugins for collecting, processing
> and publishing metrics. There's a go API (and soon more langs) for writing
> your own plugins. Plugin catalog:
> https://github.com/intelsdi-x/snap/blob/master/docs/PLUGIN_CATALOG.md
>
> On Jul 7, 2016, at 17:34, Guangya Liu <gy...@gmail.com> wrote:
>
> Have you ever tried prometheus + Grafana? Please take a look at
> https://prometheus.io/docs/visualization/grafana/ to see if it helps.
>
> On Fri, Jul 8, 2016 at 5:51 AM, David Kesler <DK...@yodle.com> wrote:
>
>> We use a combination of new relic for application level monitoring and a
>> custom python script that scrapes a bunch of stats from the docker socket
>> file and throws them into elastic so we can use kibana to make graphs.
>>
>>
>>
>> *From:* Gregory Durham [mailto:gregory.durham@gmail.com]
>> *Sent:* Thursday, July 07, 2016 4:58 PM
>> *To:* user@mesos.apache.org
>> *Cc:* krishnan.k.iyer@gmail.com; Michał Łowicki
>> *Subject:* Re: Monitoring at container level
>>
>>
>>
>> I have been using datadog to monitor my infrastructure. The integration
>> into service discovery has been really helpful for these environments.
>>
>>
>>
>> On Thu, Jul 7, 2016 at 1:37 PM, Steven Schlansker <
>> sschlansker@opentable.com> wrote:
>>
>> We use Graphite and ran into similar problems with huge metric namespaces.
>> We use the Singularity framework which provides both the task "request
>> id" (name)
>> and "instance number" (0..N) to the task.
>>
>> So we set our Graphite namespace to be "request-number" e.g. "myservice-3"
>> This has the downside of discontinuous data when you deploy a new release
>> but we haven't had too many issues due to that in practice.
>>
>>
>>
>> > On Jul 7, 2016, at 1:26 PM, Krish <kr...@gmail.com> wrote:
>> >
>> > I have had a good experience so far with bosun and scollector with
>> cadvisor.
>> > Check it out at bosun.org.
>> >
>> >
>> > On Friday 8 July 2016, Pradeep Chhetri <pr...@gmail.com>
>> wrote:
>> > Hi Michal,
>> >
>> > Do have a look at sysdig (http://www.sysdig.org). It is basically an
>> open-source tool which provides container insights. Maybe your will find
>> something helpful over there.
>> >
>> > To tackle the case of new metrics for new containers, maybe you should
>> tag metrics by service-name instead of container id. (Graphite doesn't have
>> concept of tags but something like opentsdb and influxdb do have. I don't
>> see a reason to replace graphite for that. You can use your service-name
>> (which the container is representing) instead of hostname in the metrics
>> name)
>> >
>> > On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <ml...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Before introducing Mesos we're using mainly Graphite / Grafana. Ideally
>> we would like to have metrics per container as an easy way to detect if
>> problem touches only single, subset of containers or it's global.
>> >
>> > Unfortunately using Graphite for that is far from being perfect. Having
>> container identifier as a part of metric has many negative implications
>> like having tons of new metrics every release on Marathon (new containers =
>> new identifiers).
>> >
>> > Investigated InfluxDB so far but project isn't mature enough as still
>> components like
>> https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd
>> have major blockers:
>> >
>> > COMING SOON: there will be a way to specify multiple fields.
>> >
>> > What do you use to monitor your Mesos clusters and f.ex. to detect that
>> some containers are having issues?
>> >
>> > --
>> > BR,
>> > Michał Łowicki
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Pradeep Chhetri
>> >
>> >
>> > --
>> >
>> > Thumb typed mail
>> >
>>
>>
>>
>
>


-- 
BR,
Michał Łowicki

Re: Monitoring at container level

Posted by co...@gmail.com.

Small plug for snap (https://github.com/intelsdi-x/snap). It's a telemetry framework with a lot of useful plugins for collecting, processing and publishing metrics. There's a go API (and soon more langs) for writing your own plugins. Plugin catalog: https://github.com/intelsdi-x/snap/blob/master/docs/PLUGIN_CATALOG.md

> On Jul 7, 2016, at 17:34, Guangya Liu <gy...@gmail.com> wrote:
> 
> Have you ever tried prometheus + Grafana? Please take a look at https://prometheus.io/docs/visualization/grafana/ to see if it helps.
> 
>> On Fri, Jul 8, 2016 at 5:51 AM, David Kesler <DK...@yodle.com> wrote:
>> We use a combination of new relic for application level monitoring and a custom python script that scrapes a bunch of stats from the docker socket file and throws them into elastic so we can use kibana to make graphs. 
>> 
>>  
>> 
>> From: Gregory Durham [mailto:gregory.durham@gmail.com] 
>> Sent: Thursday, July 07, 2016 4:58 PM
>> To: user@mesos.apache.org
>> Cc: krishnan.k.iyer@gmail.com; Michał Łowicki
>> Subject: Re: Monitoring at container level
>> 
>>  
>> 
>> I have been using datadog to monitor my infrastructure. The integration into service discovery has been really helpful for these environments. 
>> 
>>  
>> 
>> On Thu, Jul 7, 2016 at 1:37 PM, Steven Schlansker <ss...@opentable.com> wrote:
>> 
>> We use Graphite and ran into similar problems with huge metric namespaces.
>> We use the Singularity framework which provides both the task "request id" (name)
>> and "instance number" (0..N) to the task.
>> 
>> So we set our Graphite namespace to be "request-number" e.g. "myservice-3"
>> This has the downside of discontinuous data when you deploy a new release
>> but we haven't had too many issues due to that in practice.
>> 
>> 
>> 
>> > On Jul 7, 2016, at 1:26 PM, Krish <kr...@gmail.com> wrote:
>> >
>> > I have had a good experience so far with bosun and scollector with cadvisor.
>> > Check it out at bosun.org.
>> >
>> >
>> > On Friday 8 July 2016, Pradeep Chhetri <pr...@gmail.com> wrote:
>> > Hi Michal,
>> >
>> > Do have a look at sysdig (http://www.sysdig.org). It is basically an open-source tool which provides container insights. Maybe your will find something helpful over there.
>> >
>> > To tackle the case of new metrics for new containers, maybe you should tag metrics by service-name instead of container id. (Graphite doesn't have concept of tags but something like opentsdb and influxdb do have. I don't see a reason to replace graphite for that. You can use your service-name (which the container is representing) instead of hostname in the metrics name)
>> >
>> > On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <ml...@gmail.com> wrote:
>> > Hi,
>> >
>> > Before introducing Mesos we're using mainly Graphite / Grafana. Ideally we would like to have metrics per container as an easy way to detect if problem touches only single, subset of containers or it's global.
>> >
>> > Unfortunately using Graphite for that is far from being perfect. Having container identifier as a part of metric has many negative implications like having tons of new metrics every release on Marathon (new containers = new identifiers).
>> >
>> > Investigated InfluxDB so far but project isn't mature enough as still components like https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd have major blockers:
>> >
>> > COMING SOON: there will be a way to specify multiple fields.
>> >
>> > What do you use to monitor your Mesos clusters and f.ex. to detect that some containers are having issues?
>> >
>> > --
>> > BR,
>> > Michał Łowicki
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Pradeep Chhetri
>> >
>> >
>> > --
>> >
>> > Thumb typed mail
>> >
>> 
>

Re: Monitoring at container level

Posted by Guangya Liu <gy...@gmail.com>.

Have you ever tried prometheus + Grafana? Please take a look at
https://prometheus.io/docs/visualization/grafana/ to see if it helps.

On Fri, Jul 8, 2016 at 5:51 AM, David Kesler <DK...@yodle.com> wrote:

> We use a combination of new relic for application level monitoring and a
> custom python script that scrapes a bunch of stats from the docker socket
> file and throws them into elastic so we can use kibana to make graphs.
>
>
>
> *From:* Gregory Durham [mailto:gregory.durham@gmail.com]
> *Sent:* Thursday, July 07, 2016 4:58 PM
> *To:* user@mesos.apache.org
> *Cc:* krishnan.k.iyer@gmail.com; Michał Łowicki
> *Subject:* Re: Monitoring at container level
>
>
>
> I have been using datadog to monitor my infrastructure. The integration
> into service discovery has been really helpful for these environments.
>
>
>
> On Thu, Jul 7, 2016 at 1:37 PM, Steven Schlansker <
> sschlansker@opentable.com> wrote:
>
> We use Graphite and ran into similar problems with huge metric namespaces.
> We use the Singularity framework which provides both the task "request id"
> (name)
> and "instance number" (0..N) to the task.
>
> So we set our Graphite namespace to be "request-number" e.g. "myservice-3"
> This has the downside of discontinuous data when you deploy a new release
> but we haven't had too many issues due to that in practice.
>
>
>
> > On Jul 7, 2016, at 1:26 PM, Krish <kr...@gmail.com> wrote:
> >
> > I have had a good experience so far with bosun and scollector with
> cadvisor.
> > Check it out at bosun.org.
> >
> >
> > On Friday 8 July 2016, Pradeep Chhetri <pr...@gmail.com>
> wrote:
> > Hi Michal,
> >
> > Do have a look at sysdig (http://www.sysdig.org). It is basically an
> open-source tool which provides container insights. Maybe your will find
> something helpful over there.
> >
> > To tackle the case of new metrics for new containers, maybe you should
> tag metrics by service-name instead of container id. (Graphite doesn't have
> concept of tags but something like opentsdb and influxdb do have. I don't
> see a reason to replace graphite for that. You can use your service-name
> (which the container is representing) instead of hostname in the metrics
> name)
> >
> > On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <ml...@gmail.com>
> wrote:
> > Hi,
> >
> > Before introducing Mesos we're using mainly Graphite / Grafana. Ideally
> we would like to have metrics per container as an easy way to detect if
> problem touches only single, subset of containers or it's global.
> >
> > Unfortunately using Graphite for that is far from being perfect. Having
> container identifier as a part of metric has many negative implications
> like having tons of new metrics every release on Marathon (new containers =
> new identifiers).
> >
> > Investigated InfluxDB so far but project isn't mature enough as still
> components like
> https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd
> have major blockers:
> >
> > COMING SOON: there will be a way to specify multiple fields.
> >
> > What do you use to monitor your Mesos clusters and f.ex. to detect that
> some containers are having issues?
> >
> > --
> > BR,
> > Michał Łowicki
> >
> >
> >
> > --
> > Regards,
> > Pradeep Chhetri
> >
> >
> > --
> >
> > Thumb typed mail
> >
>
>
>

RE: Monitoring at container level

Posted by David Kesler <DK...@yodle.com>.

We use a combination of new relic for application level monitoring and a custom python script that scrapes a bunch of stats from the docker socket file and throws them into elastic so we can use kibana to make graphs.

From: Gregory Durham [mailto:gregory.durham@gmail.com]
Sent: Thursday, July 07, 2016 4:58 PM
To: user@mesos.apache.org
Cc: krishnan.k.iyer@gmail.com; Michał Łowicki
Subject: Re: Monitoring at container level

I have been using datadog to monitor my infrastructure. The integration into service discovery has been really helpful for these environments.

On Thu, Jul 7, 2016 at 1:37 PM, Steven Schlansker <ss...@opentable.com>> wrote:
We use Graphite and ran into similar problems with huge metric namespaces.
We use the Singularity framework which provides both the task "request id" (name)
and "instance number" (0..N) to the task.

So we set our Graphite namespace to be "request-number" e.g. "myservice-3"
This has the downside of discontinuous data when you deploy a new release
but we haven't had too many issues due to that in practice.


> On Jul 7, 2016, at 1:26 PM, Krish <kr...@gmail.com>> wrote:
>
> I have had a good experience so far with bosun and scollector with cadvisor.
> Check it out at bosun.org<http://bosun.org>.
>
>
> On Friday 8 July 2016, Pradeep Chhetri <pr...@gmail.com>> wrote:
> Hi Michal,
>
> Do have a look at sysdig (http://www.sysdig.org). It is basically an open-source tool which provides container insights. Maybe your will find something helpful over there.
>
> To tackle the case of new metrics for new containers, maybe you should tag metrics by service-name instead of container id. (Graphite doesn't have concept of tags but something like opentsdb and influxdb do have. I don't see a reason to replace graphite for that. You can use your service-name (which the container is representing) instead of hostname in the metrics name)
>
> On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <ml...@gmail.com>> wrote:
> Hi,
>
> Before introducing Mesos we're using mainly Graphite / Grafana. Ideally we would like to have metrics per container as an easy way to detect if problem touches only single, subset of containers or it's global.
>
> Unfortunately using Graphite for that is far from being perfect. Having container identifier as a part of metric has many negative implications like having tons of new metrics every release on Marathon (new containers = new identifiers).
>
> Investigated InfluxDB so far but project isn't mature enough as still components like https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd have major blockers:
>
> COMING SOON: there will be a way to specify multiple fields.
>
> What do you use to monitor your Mesos clusters and f.ex. to detect that some containers are having issues?
>
> --
> BR,
> Michał Łowicki
>
>
>
> --
> Regards,
> Pradeep Chhetri
>
>
> --
>
> Thumb typed mail
>

Re: Monitoring at container level

Posted by Gregory Durham <gr...@gmail.com>.

I have been using datadog to monitor my infrastructure. The integration
into service discovery has been really helpful for these environments.

On Thu, Jul 7, 2016 at 1:37 PM, Steven Schlansker <sschlansker@opentable.com
> wrote:

> We use Graphite and ran into similar problems with huge metric namespaces.
> We use the Singularity framework which provides both the task "request id"
> (name)
> and "instance number" (0..N) to the task.
>
> So we set our Graphite namespace to be "request-number" e.g. "myservice-3"
> This has the downside of discontinuous data when you deploy a new release
> but we haven't had too many issues due to that in practice.
>
>
> > On Jul 7, 2016, at 1:26 PM, Krish <kr...@gmail.com> wrote:
> >
> > I have had a good experience so far with bosun and scollector with
> cadvisor.
> > Check it out at bosun.org.
> >
> >
> > On Friday 8 July 2016, Pradeep Chhetri <pr...@gmail.com>
> wrote:
> > Hi Michal,
> >
> > Do have a look at sysdig (http://www.sysdig.org). It is basically an
> open-source tool which provides container insights. Maybe your will find
> something helpful over there.
> >
> > To tackle the case of new metrics for new containers, maybe you should
> tag metrics by service-name instead of container id. (Graphite doesn't have
> concept of tags but something like opentsdb and influxdb do have. I don't
> see a reason to replace graphite for that. You can use your service-name
> (which the container is representing) instead of hostname in the metrics
> name)
> >
> > On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <ml...@gmail.com>
> wrote:
> > Hi,
> >
> > Before introducing Mesos we're using mainly Graphite / Grafana. Ideally
> we would like to have metrics per container as an easy way to detect if
> problem touches only single, subset of containers or it's global.
> >
> > Unfortunately using Graphite for that is far from being perfect. Having
> container identifier as a part of metric has many negative implications
> like having tons of new metrics every release on Marathon (new containers =
> new identifiers).
> >
> > Investigated InfluxDB so far but project isn't mature enough as still
> components like
> https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd
> have major blockers:
> >
> > COMING SOON: there will be a way to specify multiple fields.
> >
> > What do you use to monitor your Mesos clusters and f.ex. to detect that
> some containers are having issues?
> >
> > --
> > BR,
> > Michał Łowicki
> >
> >
> >
> > --
> > Regards,
> > Pradeep Chhetri
> >
> >
> > --
> >
> > Thumb typed mail
> >
>
>

Re: Monitoring at container level

Posted by Steven Schlansker <ss...@opentable.com>.

We use Graphite and ran into similar problems with huge metric namespaces.
We use the Singularity framework which provides both the task "request id" (name)
and "instance number" (0..N) to the task.

So we set our Graphite namespace to be "request-number" e.g. "myservice-3"
This has the downside of discontinuous data when you deploy a new release
but we haven't had too many issues due to that in practice.


> On Jul 7, 2016, at 1:26 PM, Krish <kr...@gmail.com> wrote:
> 
> I have had a good experience so far with bosun and scollector with cadvisor.
> Check it out at bosun.org.
> 
> 
> On Friday 8 July 2016, Pradeep Chhetri <pr...@gmail.com> wrote:
> Hi Michal,
> 
> Do have a look at sysdig (http://www.sysdig.org). It is basically an open-source tool which provides container insights. Maybe your will find something helpful over there.
> 
> To tackle the case of new metrics for new containers, maybe you should tag metrics by service-name instead of container id. (Graphite doesn't have concept of tags but something like opentsdb and influxdb do have. I don't see a reason to replace graphite for that. You can use your service-name (which the container is representing) instead of hostname in the metrics name)
> 
> On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <ml...@gmail.com> wrote:
> Hi,
> 
> Before introducing Mesos we're using mainly Graphite / Grafana. Ideally we would like to have metrics per container as an easy way to detect if problem touches only single, subset of containers or it's global.
> 
> Unfortunately using Graphite for that is far from being perfect. Having container identifier as a part of metric has many negative implications like having tons of new metrics every release on Marathon (new containers = new identifiers).
> 
> Investigated InfluxDB so far but project isn't mature enough as still components like https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd have major blockers:
> 
> COMING SOON: there will be a way to specify multiple fields.
> 
> What do you use to monitor your Mesos clusters and f.ex. to detect that some containers are having issues?
> 
> --
> BR,
> Michał Łowicki
> 
> 
> 
> --
> Regards,
> Pradeep Chhetri
> 
> 
> --
> 
> Thumb typed mail
>

Re: Monitoring at container level

Posted by Krish <kr...@gmail.com>.

I have had a good experience so far with bosun and scollector with cadvisor.
Check it out at bosun.org.


On Friday 8 July 2016, Pradeep Chhetri <pr...@gmail.com> wrote:

> Hi Michal,
>
> Do have a look at sysdig (http://www.sysdig.org). It is basically an
> open-source tool which provides container insights. Maybe your will find
> something helpful over there.
>
> To tackle the case of new metrics for new containers, maybe you should tag
> metrics by service-name instead of container id. (Graphite doesn't have
> concept of tags but something like opentsdb and influxdb do have. I don't
> see a reason to replace graphite for that. You can use your service-name
> (which the container is representing) instead of hostname in the metrics
> name)
>
> On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <mlowicki@gmail.com
> <javascript:_e(%7B%7D,'cvml','mlowicki@gmail.com');>> wrote:
>
>> Hi,
>>
>> Before introducing Mesos we're using mainly Graphite / Grafana. Ideally
>> we would like to have metrics per container as an easy way to detect if
>> problem touches only single, subset of containers or it's global.
>>
>> Unfortunately using Graphite for that is far from being perfect. Having
>> container identifier as a part of metric has many negative implications
>> like having tons of new metrics every release on Marathon (new containers =
>> new identifiers).
>>
>> Investigated InfluxDB so far but project isn't mature enough as still
>> components like
>> https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd
>> have major blockers:
>>
>> COMING SOON: there will be a way to specify multiple fields.
>>
>>
>> What do you use to monitor your Mesos clusters and f.ex. to detect that
>> some containers are having issues?
>>
>> --
>> BR,
>> Michał Łowicki
>>
>
>
>
> --
> Regards,
> Pradeep Chhetri
>


-- 

Thumb typed mail

Re: Monitoring at container level

Posted by Pradeep Chhetri <pr...@gmail.com>.

Hi Michal,

Do have a look at sysdig (http://www.sysdig.org). It is basically an
open-source tool which provides container insights. Maybe your will find
something helpful over there.

To tackle the case of new metrics for new containers, maybe you should tag
metrics by service-name instead of container id. (Graphite doesn't have
concept of tags but something like opentsdb and influxdb do have. I don't
see a reason to replace graphite for that. You can use your service-name
(which the container is representing) instead of hostname in the metrics
name)

On Fri, Jul 8, 2016 at 1:18 AM, Michał Łowicki <ml...@gmail.com> wrote:

> Hi,
>
> Before introducing Mesos we're using mainly Graphite / Grafana. Ideally we
> would like to have metrics per container as an easy way to detect if
> problem touches only single, subset of containers or it's global.
>
> Unfortunately using Graphite for that is far from being perfect. Having
> container identifier as a part of metric has many negative implications
> like having tons of new metrics every release on Marathon (new containers =
> new identifiers).
>
> Investigated InfluxDB so far but project isn't mature enough as still
> components like
> https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md#influx-statsd
> have major blockers:
>
> COMING SOON: there will be a way to specify multiple fields.
>
>
> What do you use to monitor your Mesos clusters and f.ex. to detect that
> some containers are having issues?
>
> --
> BR,
> Michał Łowicki
>

-- 
Regards,
Pradeep Chhetri