You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/05/23 13:41:22 UTC

[GitHub] [dolphinscheduler] iamliangdi opened a new issue, #10211: [Feature][Alert] Alarm availability improvement

iamliangdi opened a new issue, #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar feature requirement.
   
   
   ### Description
   
   Today, I found a problem in production. If the alarm stops service, who will tell me?
   Obviously, this is an unavoidable problem, and there is no perfect solution after all, but it should be decided by the user, even if he is willing to use 100 servers to ensure that he can receive warnings.
   
   ### Use case
   
   Since other modules can be decentralized, the alarm module can also improve the availability of higher levels according to the needs of users.
   But I checked the code, which is thread unsafe. I suggest modifying it to send alarms in a thread safe way.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] zhongjiajie commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
zhongjiajie commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1140791184

   > Give the power to deploy multiple alert-servers to users, and we are only responsible for ensuring that they will not send duplicate information;
   
   Sound great, but single alert-server is enough for most cases. I find you want to submit PR to it, do you have any plan?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] EricGao888 commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
EricGao888 commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1193440780

   Hi @iamliangdi, may I ask whether you have any follow-ups to this issue? I'm working on metrics and will add `alert server` related metrics. see: #11131 
   However, in K8S scenario, HA ability for alert server is necessary as K8S could bring up a new one once the pod of `alert server` is down. WDYT? @SbloodyS @zhongjiajie @iamliangdi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] EricGao888 commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
EricGao888 commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134716372

   > > Hi iamliangdi, thx for opening this issue. It is a good point that ds should have some alert mechanism in case `alert-server` fails itself. May I ask how do u want to improve the availability of `alert-server`? Are u planning to enable users to deploy multiple alert-servers?
   > 
   > hi, Because I just started looking at the code, limited to ability, my idea is
   > 
   > 1. Give the power to deploy multiple alert-servers to users, and we are only responsible for ensuring that they will not send duplicate information;
   > 2. Send email to users in the fatal errors we catch as much as possible;
   > 
   > Because in the process of using it, I found that it stopped serving, but didn't tell me it was leaving
   
   @zhongjiajie @SbloodyS I saw there are tables related to alerting in ds metaDB. Just for double-check, if alert-server fails for some reason and restarted later, during the failure, will alerts still get stored in db and sent as soon as alert-server restarts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134696249

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://join.slack.com/t/asf-dolphinscheduler/shared_invite/zt-omtdhuio-_JISsxYhiVsltmC5h38yfw) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] EricGao888 commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
EricGao888 commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134703847

   Hi iamliangdi, thx for opening this issue. It is a good point that ds should have some alert mechanism in case `alert-server` fails itself. May I ask how do u want to improve the availability of `alert-server`? Are u planning to enable users to deploy multiple alert-servers?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] SbloodyS commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
SbloodyS commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134724658

   > > > Hi iamliangdi, thx for opening this issue. It is a good point that ds should have some alert mechanism in case `alert-server` fails itself. May I ask how do u want to improve the availability of `alert-server`? Are u planning to enable users to deploy multiple alert-servers?
   > > 
   > > 
   > > hi, Because I just started looking at the code, limited to ability, my idea is
   > > 
   > > 1. Give the power to deploy multiple alert-servers to users, and we are only responsible for ensuring that they will not send duplicate information;
   > > 2. Send email to users in the fatal errors we catch as much as possible;
   > > 
   > > Because in the process of using it, I found that it stopped serving, but didn't tell me it was leaving
   > 
   > @zhongjiajie @SbloodyS I saw there are tables related to alerting in ds metaDB. Just for double-check, if alert-server fails for some reason and restarted later, during the failure, will alerts still get stored in db and sent as soon as alert-server restarts?
   
   Yes. The alert service consumes very few resources. So i recommand the following two ways to ensure production stability:
   1. Deploy alert server in a single instance.
   2. Using systemd or supervisor to manage the process of ```alert-server```. @iamliangdi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] iamliangdi commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
iamliangdi commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134710578

   > Hi iamliangdi, thx for opening this issue. It is a good point that ds should have some alert mechanism in case `alert-server` fails itself. May I ask how do u want to improve the availability of `alert-server`? Are u planning to enable users to deploy multiple alert-servers?
   
   hi, Because I just started looking at the code, limited to ability, my idea is
   
   1. Give the power to deploy multiple alert-servers to users, and we are only responsible for ensuring that they will not send duplicate information;
   
   2. Send email to users in the fatal errors we catch as much as possible;
   
   Because in the process of using it, I found that it stopped serving, but didn't tell me it was leaving


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] EricGao888 commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
EricGao888 commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134730616

   Also, I think we could include alert-server heartbeats in `DS metrics` so that it can be monitored by external system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] SbloodyS commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
SbloodyS commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134751613

   > Also, I think we could include alert-server heartbeats in `DS metrics` so that it can be monitored by external system.
   
   That's a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] iamliangdi commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
iamliangdi commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1134723929

   > > > Hi iamliangdi, thx for opening this issue. It is a good point that ds should have some alert mechanism in case `alert-server` fails itself. May I ask how do u want to improve the availability of `alert-server`? Are u planning to enable users to deploy multiple alert-servers?
   > > 
   > > 
   > > hi, Because I just started looking at the code, limited to ability, my idea is
   > > 
   > > 1. Give the power to deploy multiple alert-servers to users, and we are only responsible for ensuring that they will not send duplicate information;
   > > 2. Send email to users in the fatal errors we catch as much as possible;
   > > 
   > > Because in the process of using it, I found that it stopped serving, but didn't tell me it was leaving
   > 
   > @zhongjiajie @SbloodyS I saw there are tables related to alerting in ds metaDB. Just for double-check, if alert-server fails for some reason and restarted later, during the failure, will alerts still get stored in db and sent as soon as alert-server restarts?
   
   If the alert status is normal, it will be sent again, because in a cycle, it will keep getting the unsent alert , but if the alert server stops serving, I will not be able to know the health of any service. Obviously, I need to ensure that the alert server is normal as much as possible, or tell me when it is unhealthy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] zhongjiajie commented on issue #10211: [Feature][Alert] Alarm availability improvement

Posted by GitBox <gi...@apache.org>.
zhongjiajie commented on issue #10211:
URL: https://github.com/apache/dolphinscheduler/issues/10211#issuecomment-1229380158

   > Hi @iamliangdi, may I ask whether you have any follow-ups to this issue? I'm working on metrics and will add `alert server` related metrics. see: #11131
   > However, in K8S scenario, HA ability for alert server is necessary as K8S could bring up a new one once the pod of `alert server` is down. WDYT? @SbloodyS @zhongjiajie @iamliangdi
   
   Yeah, agree with the  k8s scenario, but in another deployment, we also need some health checks or metrics to tell us


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org