You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@dolphinscheduler.apache.org by "felix@thinkingdata.cn" <fe...@thinkingdata.cn> on 2020/08/22 11:29:37 UTC

About the high availability implementation of the Alert service

hi  ALL

I would like to make a suggestion that the Alert Module is not currently designed to be in a high availability state, and that there are problems with sending repeated alerts when multiple alert services are started.
Alarm service down, DS alarm failure problem.
So far, I've come up with two architectures that address the problem of sending warning messages repeatedly, while implementing the high-availability Alert Moduler feature.

1、The first is the master-slave relationship between the alert services through ZK. Only the master node is responsible for sending information. After the master node is suspended, the master is selected again, and the new master node continues to provide the warning service.
2.The second is a de-centralised design in which all alert services work simultaneously through exclusive locks between them, in which case the alert messages are not repeated.

If we have a better plan, we can discuss it together

Thx

中文：
我提一个建议，目前alert module 设计上还不是高可用状态，存在启动多个alert 服务时，会重复发送告警信息的问题。
告警服务挂掉，ds告警功能失效的问题。
目前我想到了两种架构来解决重复发送告警信息的问题，同时实现alert moduler高可用功能。
1.第一种是alert 服务之间通过zk 实现主从关系，只有主节点来负责信息发送，在主节点挂掉后，重新选主，新的主节点来继续提供告警服务。
2.第二种采用去中心的设计，alert 服务 之间通过排它锁来实现所有alert 服务同时工作，并在这种情况下保证告警信息不重复发送。
如果大家有更好的方案，可以一起讨论

谢谢
 



felix@thinkingdata.cn

Re: About the high availability implementation of the Alert service

Posted by JUN GAO <ga...@gmail.com>.

I think the first one is better.

felix@thinkingdata.cn <fe...@thinkingdata.cn>于2020年8月22日 周六19:30写道：

> hi  ALL
>
> I would like to make a suggestion that the Alert Module is not currently
> designed to be in a high availability state, and that there are problems
> with sending repeated alerts when multiple alert services are started.
> Alarm service down, DS alarm failure problem.
> So far, I've come up with two architectures that address the problem of
> sending warning messages repeatedly, while implementing the
> high-availability Alert Moduler feature.
>
> 1、The first is the master-slave relationship between the alert services
> through ZK. Only the master node is responsible for sending information.
> After the master node is suspended, the master is selected again, and the
> new master node continues to provide the warning service.
> 2.The second is a de-centralised design in which all alert services work
> simultaneously through exclusive locks between them, in which case the
> alert messages are not repeated.
>
> If we have a better plan, we can discuss it together
>
> Thx
>
> 中文：
> 我提一个建议，目前alert module 设计上还不是高可用状态，存在启动多个alert 服务时，会重复发送告警信息的问题。
> 告警服务挂掉，ds告警功能失效的问题。
> 目前我想到了两种架构来解决重复发送告警信息的问题，同时实现alert moduler高可用功能。
> 1.第一种是alert 服务之间通过zk 实现主从关系，只有主节点来负责信息发送，在主节点挂掉后，重新选主，新的主节点来继续提供告警服务。
> 2.第二种采用去中心的设计，alert 服务 之间通过排它锁来实现所有alert 服务同时工作，并在这种情况下保证告警信息不重复发送。
> 如果大家有更好的方案，可以一起讨论
>
> 谢谢
>
>
>
>
> felix@thinkingdata.cn
>
-- 
DolphinScheduler(Incubator)  PPMC
Jun Gao 高俊
gaojun2048@gmail.com