You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@helix.apache.org by Kanak Biscuitwala <ka...@hotmail.com> on 2013/11/18 21:18:00 UTC

Design for a Helix Monitoring Framework

Hi,

We've sketched out an initial high-level design for a new monitoring framework for Helix. The primary goal is to decrease our time to detect soft failures, but this is really just a design to help propagate statistics around in any Helix-managed system. Any feedback is appreciated.

The document is available on the wiki: https://cwiki.apache.org/confluence/display/HELIX/Helix+Monitoring+Design

Thanks,
Kanak

RE: Design for a Helix Monitoring Framework

Posted by Kanak Biscuitwala <ka...@hotmail.com>.

I spent some time stress-testing a Riemann server alongside a controller, and in some pathological cases, this can really bog down the controller. However, it seems like a logical start.

Eventually, it probably makes the most sense to have monitoring servers be Helix participants themselves and then have Helix controllers, participants, and spectators be spectators on the monitoring server cluster. This allows us to scale up our monitoring servers at will and deal with faults automatically. Then, we can shard our alerts and send them to the appropriate destination with consistent hashing. This will also help with pluggable monitoring services. The big thing here is to abstract away the connection code on the Helix side so that we can swap out different configurations at will.

I think the way I'm planning to do this is in stages:
1. Get the basic monitoring framework working with the monitoring server on the same process as the controller
2. Make the monitoring server a Helix participant in its own cluster
3. Adapt the Helix code to spectate on the monitoring cluster

Stress test findings: https://cwiki.apache.org/confluence/display/HELIX/Stress+Testing
Added alternatives section to: https://cwiki.apache.org/confluence/display/HELIX/Helix+Monitoring+Design

Kanak

----------------------------------------
> From: kanak.b@hotmail.com
> To: dev@helix.incubator.apache.org
> Subject: RE: Design for a Helix Monitoring Framework
> Date: Mon, 18 Nov 2013 18:19:54 -0800
>
> This is interesting. Sirona collectors could work for our purpose as well, from what I've read. Much of the current design focuses on preventing the monitoring framework from being a helix-core dependency by creating new modules and registering things with an interface. I think a cool side effect of this is that we could theoretically plug in any client/server monitoring system.
>
> I still like Riemann for the first cut because we've already invested time in experimenting with it and it seems to do the job. Certainly adding Sirona in the future would be cool, too.
>
> Kanak
> ----------------------------------------
>> From: olamy@apache.org
>> Date: Tue, 19 Nov 2013 09:18:50 +1100
>> Subject: Re: Design for a Helix Monitoring Framework
>> To: dev@helix.incubator.apache.org
>>
>> Sounds interesting.
>> FYI I started a new project @apache related to monitoring.
>> See: http://sirona.incubator.apache.org/
>>
>> It's plugin based mechanism so maybe a Helix plugin can be created.
>> Let me know and feel free to start a thread on dev@sirona
>>
>>
>>
>> On 19 November 2013 07:18, Kanak Biscuitwala <ka...@hotmail.com> wrote:
>>> Hi,
>>>
>>> We've sketched out an initial high-level design for a new monitoring framework for Helix. The primary goal is to decrease our time to detect soft failures, but this is really just a design to help propagate statistics around in any Helix-managed system. Any feedback is appreciated.
>>>
>>> The document is available on the wiki: https://cwiki.apache.org/confluence/display/HELIX/Helix+Monitoring+Design
>>>
>>> Thanks,
>>> Kanak
>>
>>
>>
>> --
>> Olivier Lamy
>> Ecetera: http://ecetera.com.au
>> http://twitter.com/olamy | http://linkedin.com/in/olamy

RE: Design for a Helix Monitoring Framework

Posted by Kanak Biscuitwala <ka...@hotmail.com>.

This is interesting. Sirona collectors could work for our purpose as well, from what I've read. Much of the current design focuses on preventing the monitoring framework from being a helix-core dependency by creating new modules and registering things with an interface. I think a cool side effect of this is that we could theoretically plug in any client/server monitoring system.

I still like Riemann for the first cut because we've already invested time in experimenting with it and it seems to do the job. Certainly adding Sirona in the future would be cool, too.

Kanak
----------------------------------------
> From: olamy@apache.org
> Date: Tue, 19 Nov 2013 09:18:50 +1100
> Subject: Re: Design for a Helix Monitoring Framework
> To: dev@helix.incubator.apache.org
>
> Sounds interesting.
> FYI I started a new project @apache related to monitoring.
> See: http://sirona.incubator.apache.org/
>
> It's plugin based mechanism so maybe a Helix plugin can be created.
> Let me know and feel free to start a thread on dev@sirona
>
>
>
> On 19 November 2013 07:18, Kanak Biscuitwala <ka...@hotmail.com> wrote:
>> Hi,
>>
>> We've sketched out an initial high-level design for a new monitoring framework for Helix. The primary goal is to decrease our time to detect soft failures, but this is really just a design to help propagate statistics around in any Helix-managed system. Any feedback is appreciated.
>>
>> The document is available on the wiki: https://cwiki.apache.org/confluence/display/HELIX/Helix+Monitoring+Design
>>
>> Thanks,
>> Kanak
>
>
>
> --
> Olivier Lamy
> Ecetera: http://ecetera.com.au
> http://twitter.com/olamy | http://linkedin.com/in/olamy

Re: Design for a Helix Monitoring Framework

Posted by Olivier Lamy <ol...@apache.org>.

Sounds interesting.
FYI I started a new project @apache related to monitoring.
See: http://sirona.incubator.apache.org/

It's plugin based mechanism so maybe a Helix plugin can be created.
Let me know and feel free to start a thread on dev@sirona



On 19 November 2013 07:18, Kanak Biscuitwala <ka...@hotmail.com> wrote:
> Hi,
>
> We've sketched out an initial high-level design for a new monitoring framework for Helix. The primary goal is to decrease our time to detect soft failures, but this is really just a design to help propagate statistics around in any Helix-managed system. Any feedback is appreciated.
>
> The document is available on the wiki: https://cwiki.apache.org/confluence/display/HELIX/Helix+Monitoring+Design
>
> Thanks,
> Kanak



-- 
Olivier Lamy
Ecetera: http://ecetera.com.au
http://twitter.com/olamy | http://linkedin.com/in/olamy