You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Sai Phanindhra <ph...@gmail.com> on 2018/11/14 15:11:27 UTC

Customised alerts/notifications and enhancements to alerting/notifications on Airflow

Hello airflow committers and maintainers,
       I came across sla in airflow. It's a very good feature to begin
with. I feel like few enhancements can be done. These enhancements are not
limited to just sla, they basically are voids i felt when im using airflow.
Im listing few of them here.

   1. SLA alerts to slack channel(s) along with emails
   2. Alerts at DAG level(starting, success and failure).
   3. custom callbacks just like `*on_failure_callback*`, `
   *on_retry_callback*` and `*on_success_callback*` on DAG level.
   4. Alerts if task gets completed before minimum run time(This is really
   a rare case. But there will be few long running jobs that we know for sure
   runs for at least few hours and if they exit before that it means something
   wrong. We need warning alerts for such cases.)
   5. Default/Global Alert config(default emails to send all alerts and/or
   slack channel to send alerts)

Some of these might have already been solved or someone is working to
solve. Please share your thoughts and add anything else i missed to this
list.

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

Posted by James Meickle <jm...@quantopian.com.INVALID>.
FYI I am on the Airflow Slack but only check it on weekends mostly.

Here is a gist with my implementation of a Slack callback, which attaches
icons/buttons/emoji/etc.:
https://gist.github.com/Eronarn/99408c0e5b0dd964487a5eea64b34f6d

On Wed, Nov 14, 2018 at 1:54 PM Sai Phanindhra <ph...@gmail.com> wrote:

> Thanks James for the input.
> For the problems i specified above, i build hacky solutions like adding one
> `*slack_start_notification_operation*` in beginning, `
> *slack_end_notification_operator*` in the end and `
> *slack_failed_notification_operation*` when upstream fails. This addresses
> first 3 issues/feature requirements i spoke about. I am maintaining lists
> for emails and at dag level i'm joing all required emails for addressing
> point 5. Still i feel like this is manual work and need to be done every
> time a new dag onboards in airflow. I feel like these are common problems
> many of the airflow users/developers face.
> @James <jm...@quantopian.com> lets catch up someone on slack/hangout to
> discuss how these enhancements can be done.
>
>
> On Thu, 15 Nov 2018 at 00:10, James Meickle <jmeickle@quantopian.com
> .invalid>
> wrote:
>
> > As the author of the first linked PR, I think your points are good. Here
> is
> > my attempt to address them:
> >
> > 1: It is possible to do this today if you write a Slack callback. I would
> > be happy to share my code for this if you're having trouble integrating
> > Slack. That being said, it would be great if Airflow provided several
> > "default" callbacks for common platforms like Slack and Pagerduty.
> >
> > 2/3: Yes, Airflow should add callbacks for the DAG lifecycle, too. DAG
> > "SLAs" on the other hand, I am not sure would provide any additional
> value,
> > and have a high chance of being misused.
> >
> > 4: That's a great idea. My PR would make adding this very easy, because
> it
> > redefines the "SLAMiss" object as having a "type" of SLA miss. This would
> > involve adding a new type to the enum, and some logic to check when to
> > create an SLA miss of this type.
> >
> > 5: My interpretation is that you mean an email address that always gets
> > notified, regardless of any more specific users that a task says it
> should
> > email. (So not a default value to "emails", but instead an additional
> value
> > that is always added.) I think this makes a lot of sense and would be
> easy
> > to add to email. It would not be even remotely possible for a Slack
> > integration right now, since there's no unified code for that.
> >
> > My preferred way of addressing this would be to get my PR merged as a
> > starting point, which isolates a lot of this functionality from the
> > scheduler code. Then have a broader AIP created, or possibly a pair of
> > them: switching to a more general evented system for Airflow model
> > lifecycles, and implementing pluggable notifiers (right now a lot of the
> > email functionality is hardcoded) the same way that there is already
> > pluggable logging.
> >
> > From an SRE perspective, two other pain points we run into: the statsd
> > integration is subpar (at least when we ingest it in Datadog it's hard to
> > actually alert on), and there's no /health or /healthz endpoints for the
> > scheduler and worker so it's hard to know if they are healthy in a
> > programmatic way.
> >
> > On Wed, Nov 14, 2018 at 1:06 PM Niels Zeilemaker <ni...@zeilemaker.nl>
> > wrote:
> >
> > > I had a go once to introduce something similar, but never got it
> merged.
> > > Maybe you can use it as an inspiration.
> > >
> > > https://github.com/apache/incubator-airflow/pull/2412
> > >
> > > Niels
> > >
> > > Op wo 14 nov. 2018 16:43 schreef Sai Phanindhra <phani8996@gmail.com:
> > >
> > > > Above mentioned PR address issues/bugs in current functionality. I
> want
> > > to
> > > > add more mediums of alerting which includes SLA.
> > > >
> > > > On Wed, 14 Nov 2018 at 20:51, airflowuser
> > > > <ai...@protonmail.com.invalid> wrote:
> > > >
> > > > > There is a pending PR to refactor the SLA:
> > > > > https://github.com/apache/incubator-airflow/pull/3584
> > > > >
> > > > > But it requires more reviews from committers.
> > > > >
> > > > >
> > > > > Sent with ProtonMail Secure Email.
> > > > >
> > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <
> > > > > phani8996@gmail.com> wrote:
> > > > >
> > > > > > Hello airflow committers and maintainers,
> > > > > > I came across sla in airflow. It's a very good feature to begin
> > > > > > with. I feel like few enhancements can be done. These
> enhancements
> > > are
> > > > > not
> > > > > > limited to just sla, they basically are voids i felt when im
> using
> > > > > airflow.
> > > > > > Im listing few of them here.
> > > > > >
> > > > > > 1.  SLA alerts to slack channel(s) along with emails
> > > > > > 2.  Alerts at DAG level(starting, success and failure).
> > > > > > 3.  custom callbacks just like `*on_failure_callback*`,
> > > > > `*on_retry_callback*` and `*on_success_callback*` on DAG level.
> > > > > > 4.  Alerts if task gets completed before minimum run time(This is
> > > > really
> > > > > >     a rare case. But there will be few long running jobs that we
> > know
> > > > > for sure
> > > > > >     runs for at least few hours and if they exit before that it
> > means
> > > > > something
> > > > > >     wrong. We need warning alerts for such cases.)
> > > > > >
> > > > > > 5.  Default/Global Alert config(default emails to send all alerts
> > > > and/or
> > > > > >     slack channel to send alerts)
> > > > > >
> > > > > >     Some of these might have already been solved or someone is
> > > working
> > > > to
> > > > > >     solve. Please share your thoughts and add anything else i
> > missed
> > > to
> > > > > this
> > > > > >     list.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Sai Phanindhra,
> > > > Ph: +91 9043258999
> > > >
> > >
> >
>
>
> --
> Sai Phanindhra,
> Ph: +91 9043258999
>

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

Posted by Sai Phanindhra <ph...@gmail.com>.
Thanks James for the input.
For the problems i specified above, i build hacky solutions like adding one
`*slack_start_notification_operation*` in beginning, `
*slack_end_notification_operator*` in the end and `
*slack_failed_notification_operation*` when upstream fails. This addresses
first 3 issues/feature requirements i spoke about. I am maintaining lists
for emails and at dag level i'm joing all required emails for addressing
point 5. Still i feel like this is manual work and need to be done every
time a new dag onboards in airflow. I feel like these are common problems
many of the airflow users/developers face.
@James <jm...@quantopian.com> lets catch up someone on slack/hangout to
discuss how these enhancements can be done.


On Thu, 15 Nov 2018 at 00:10, James Meickle <jm...@quantopian.com.invalid>
wrote:

> As the author of the first linked PR, I think your points are good. Here is
> my attempt to address them:
>
> 1: It is possible to do this today if you write a Slack callback. I would
> be happy to share my code for this if you're having trouble integrating
> Slack. That being said, it would be great if Airflow provided several
> "default" callbacks for common platforms like Slack and Pagerduty.
>
> 2/3: Yes, Airflow should add callbacks for the DAG lifecycle, too. DAG
> "SLAs" on the other hand, I am not sure would provide any additional value,
> and have a high chance of being misused.
>
> 4: That's a great idea. My PR would make adding this very easy, because it
> redefines the "SLAMiss" object as having a "type" of SLA miss. This would
> involve adding a new type to the enum, and some logic to check when to
> create an SLA miss of this type.
>
> 5: My interpretation is that you mean an email address that always gets
> notified, regardless of any more specific users that a task says it should
> email. (So not a default value to "emails", but instead an additional value
> that is always added.) I think this makes a lot of sense and would be easy
> to add to email. It would not be even remotely possible for a Slack
> integration right now, since there's no unified code for that.
>
> My preferred way of addressing this would be to get my PR merged as a
> starting point, which isolates a lot of this functionality from the
> scheduler code. Then have a broader AIP created, or possibly a pair of
> them: switching to a more general evented system for Airflow model
> lifecycles, and implementing pluggable notifiers (right now a lot of the
> email functionality is hardcoded) the same way that there is already
> pluggable logging.
>
> From an SRE perspective, two other pain points we run into: the statsd
> integration is subpar (at least when we ingest it in Datadog it's hard to
> actually alert on), and there's no /health or /healthz endpoints for the
> scheduler and worker so it's hard to know if they are healthy in a
> programmatic way.
>
> On Wed, Nov 14, 2018 at 1:06 PM Niels Zeilemaker <ni...@zeilemaker.nl>
> wrote:
>
> > I had a go once to introduce something similar, but never got it merged.
> > Maybe you can use it as an inspiration.
> >
> > https://github.com/apache/incubator-airflow/pull/2412
> >
> > Niels
> >
> > Op wo 14 nov. 2018 16:43 schreef Sai Phanindhra <phani8996@gmail.com:
> >
> > > Above mentioned PR address issues/bugs in current functionality. I want
> > to
> > > add more mediums of alerting which includes SLA.
> > >
> > > On Wed, 14 Nov 2018 at 20:51, airflowuser
> > > <ai...@protonmail.com.invalid> wrote:
> > >
> > > > There is a pending PR to refactor the SLA:
> > > > https://github.com/apache/incubator-airflow/pull/3584
> > > >
> > > > But it requires more reviews from committers.
> > > >
> > > >
> > > > Sent with ProtonMail Secure Email.
> > > >
> > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <
> > > > phani8996@gmail.com> wrote:
> > > >
> > > > > Hello airflow committers and maintainers,
> > > > > I came across sla in airflow. It's a very good feature to begin
> > > > > with. I feel like few enhancements can be done. These enhancements
> > are
> > > > not
> > > > > limited to just sla, they basically are voids i felt when im using
> > > > airflow.
> > > > > Im listing few of them here.
> > > > >
> > > > > 1.  SLA alerts to slack channel(s) along with emails
> > > > > 2.  Alerts at DAG level(starting, success and failure).
> > > > > 3.  custom callbacks just like `*on_failure_callback*`,
> > > > `*on_retry_callback*` and `*on_success_callback*` on DAG level.
> > > > > 4.  Alerts if task gets completed before minimum run time(This is
> > > really
> > > > >     a rare case. But there will be few long running jobs that we
> know
> > > > for sure
> > > > >     runs for at least few hours and if they exit before that it
> means
> > > > something
> > > > >     wrong. We need warning alerts for such cases.)
> > > > >
> > > > > 5.  Default/Global Alert config(default emails to send all alerts
> > > and/or
> > > > >     slack channel to send alerts)
> > > > >
> > > > >     Some of these might have already been solved or someone is
> > working
> > > to
> > > > >     solve. Please share your thoughts and add anything else i
> missed
> > to
> > > > this
> > > > >     list.
> > > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Sai Phanindhra,
> > > Ph: +91 9043258999
> > >
> >
>


-- 
Sai Phanindhra,
Ph: +91 9043258999

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

Posted by James Meickle <jm...@quantopian.com.INVALID>.
As the author of the first linked PR, I think your points are good. Here is
my attempt to address them:

1: It is possible to do this today if you write a Slack callback. I would
be happy to share my code for this if you're having trouble integrating
Slack. That being said, it would be great if Airflow provided several
"default" callbacks for common platforms like Slack and Pagerduty.

2/3: Yes, Airflow should add callbacks for the DAG lifecycle, too. DAG
"SLAs" on the other hand, I am not sure would provide any additional value,
and have a high chance of being misused.

4: That's a great idea. My PR would make adding this very easy, because it
redefines the "SLAMiss" object as having a "type" of SLA miss. This would
involve adding a new type to the enum, and some logic to check when to
create an SLA miss of this type.

5: My interpretation is that you mean an email address that always gets
notified, regardless of any more specific users that a task says it should
email. (So not a default value to "emails", but instead an additional value
that is always added.) I think this makes a lot of sense and would be easy
to add to email. It would not be even remotely possible for a Slack
integration right now, since there's no unified code for that.

My preferred way of addressing this would be to get my PR merged as a
starting point, which isolates a lot of this functionality from the
scheduler code. Then have a broader AIP created, or possibly a pair of
them: switching to a more general evented system for Airflow model
lifecycles, and implementing pluggable notifiers (right now a lot of the
email functionality is hardcoded) the same way that there is already
pluggable logging.

From an SRE perspective, two other pain points we run into: the statsd
integration is subpar (at least when we ingest it in Datadog it's hard to
actually alert on), and there's no /health or /healthz endpoints for the
scheduler and worker so it's hard to know if they are healthy in a
programmatic way.

On Wed, Nov 14, 2018 at 1:06 PM Niels Zeilemaker <ni...@zeilemaker.nl>
wrote:

> I had a go once to introduce something similar, but never got it merged.
> Maybe you can use it as an inspiration.
>
> https://github.com/apache/incubator-airflow/pull/2412
>
> Niels
>
> Op wo 14 nov. 2018 16:43 schreef Sai Phanindhra <phani8996@gmail.com:
>
> > Above mentioned PR address issues/bugs in current functionality. I want
> to
> > add more mediums of alerting which includes SLA.
> >
> > On Wed, 14 Nov 2018 at 20:51, airflowuser
> > <ai...@protonmail.com.invalid> wrote:
> >
> > > There is a pending PR to refactor the SLA:
> > > https://github.com/apache/incubator-airflow/pull/3584
> > >
> > > But it requires more reviews from committers.
> > >
> > >
> > > Sent with ProtonMail Secure Email.
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <
> > > phani8996@gmail.com> wrote:
> > >
> > > > Hello airflow committers and maintainers,
> > > > I came across sla in airflow. It's a very good feature to begin
> > > > with. I feel like few enhancements can be done. These enhancements
> are
> > > not
> > > > limited to just sla, they basically are voids i felt when im using
> > > airflow.
> > > > Im listing few of them here.
> > > >
> > > > 1.  SLA alerts to slack channel(s) along with emails
> > > > 2.  Alerts at DAG level(starting, success and failure).
> > > > 3.  custom callbacks just like `*on_failure_callback*`,
> > > `*on_retry_callback*` and `*on_success_callback*` on DAG level.
> > > > 4.  Alerts if task gets completed before minimum run time(This is
> > really
> > > >     a rare case. But there will be few long running jobs that we know
> > > for sure
> > > >     runs for at least few hours and if they exit before that it means
> > > something
> > > >     wrong. We need warning alerts for such cases.)
> > > >
> > > > 5.  Default/Global Alert config(default emails to send all alerts
> > and/or
> > > >     slack channel to send alerts)
> > > >
> > > >     Some of these might have already been solved or someone is
> working
> > to
> > > >     solve. Please share your thoughts and add anything else i missed
> to
> > > this
> > > >     list.
> > > >
> > >
> > >
> > >
> >
> > --
> > Sai Phanindhra,
> > Ph: +91 9043258999
> >
>

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

Posted by Niels Zeilemaker <ni...@zeilemaker.nl>.
I had a go once to introduce something similar, but never got it merged.
Maybe you can use it as an inspiration.

https://github.com/apache/incubator-airflow/pull/2412

Niels

Op wo 14 nov. 2018 16:43 schreef Sai Phanindhra <phani8996@gmail.com:

> Above mentioned PR address issues/bugs in current functionality. I want to
> add more mediums of alerting which includes SLA.
>
> On Wed, 14 Nov 2018 at 20:51, airflowuser
> <ai...@protonmail.com.invalid> wrote:
>
> > There is a pending PR to refactor the SLA:
> > https://github.com/apache/incubator-airflow/pull/3584
> >
> > But it requires more reviews from committers.
> >
> >
> > Sent with ProtonMail Secure Email.
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <
> > phani8996@gmail.com> wrote:
> >
> > > Hello airflow committers and maintainers,
> > > I came across sla in airflow. It's a very good feature to begin
> > > with. I feel like few enhancements can be done. These enhancements are
> > not
> > > limited to just sla, they basically are voids i felt when im using
> > airflow.
> > > Im listing few of them here.
> > >
> > > 1.  SLA alerts to slack channel(s) along with emails
> > > 2.  Alerts at DAG level(starting, success and failure).
> > > 3.  custom callbacks just like `*on_failure_callback*`,
> > `*on_retry_callback*` and `*on_success_callback*` on DAG level.
> > > 4.  Alerts if task gets completed before minimum run time(This is
> really
> > >     a rare case. But there will be few long running jobs that we know
> > for sure
> > >     runs for at least few hours and if they exit before that it means
> > something
> > >     wrong. We need warning alerts for such cases.)
> > >
> > > 5.  Default/Global Alert config(default emails to send all alerts
> and/or
> > >     slack channel to send alerts)
> > >
> > >     Some of these might have already been solved or someone is working
> to
> > >     solve. Please share your thoughts and add anything else i missed to
> > this
> > >     list.
> > >
> >
> >
> >
>
> --
> Sai Phanindhra,
> Ph: +91 9043258999
>

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

Posted by Sai Phanindhra <ph...@gmail.com>.
Above mentioned PR address issues/bugs in current functionality. I want to
add more mediums of alerting which includes SLA.

On Wed, 14 Nov 2018 at 20:51, airflowuser
<ai...@protonmail.com.invalid> wrote:

> There is a pending PR to refactor the SLA:
> https://github.com/apache/incubator-airflow/pull/3584
>
> But it requires more reviews from committers.
>
>
> Sent with ProtonMail Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <
> phani8996@gmail.com> wrote:
>
> > Hello airflow committers and maintainers,
> > I came across sla in airflow. It's a very good feature to begin
> > with. I feel like few enhancements can be done. These enhancements are
> not
> > limited to just sla, they basically are voids i felt when im using
> airflow.
> > Im listing few of them here.
> >
> > 1.  SLA alerts to slack channel(s) along with emails
> > 2.  Alerts at DAG level(starting, success and failure).
> > 3.  custom callbacks just like `*on_failure_callback*`,
> `*on_retry_callback*` and `*on_success_callback*` on DAG level.
> > 4.  Alerts if task gets completed before minimum run time(This is really
> >     a rare case. But there will be few long running jobs that we know
> for sure
> >     runs for at least few hours and if they exit before that it means
> something
> >     wrong. We need warning alerts for such cases.)
> >
> > 5.  Default/Global Alert config(default emails to send all alerts and/or
> >     slack channel to send alerts)
> >
> >     Some of these might have already been solved or someone is working to
> >     solve. Please share your thoughts and add anything else i missed to
> this
> >     list.
> >
>
>
>

-- 
Sai Phanindhra,
Ph: +91 9043258999

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

Posted by airflowuser <ai...@protonmail.com.INVALID>.
There is a pending PR to refactor the SLA:
https://github.com/apache/incubator-airflow/pull/3584

But it requires more reviews from committers.


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <ph...@gmail.com> wrote:

> Hello airflow committers and maintainers,
> I came across sla in airflow. It's a very good feature to begin
> with. I feel like few enhancements can be done. These enhancements are not
> limited to just sla, they basically are voids i felt when im using airflow.
> Im listing few of them here.
>
> 1.  SLA alerts to slack channel(s) along with emails
> 2.  Alerts at DAG level(starting, success and failure).
> 3.  custom callbacks just like `*on_failure_callback*`, `*on_retry_callback*` and `*on_success_callback*` on DAG level.
> 4.  Alerts if task gets completed before minimum run time(This is really
>     a rare case. But there will be few long running jobs that we know for sure
>     runs for at least few hours and if they exit before that it means something
>     wrong. We need warning alerts for such cases.)
>
> 5.  Default/Global Alert config(default emails to send all alerts and/or
>     slack channel to send alerts)
>
>     Some of these might have already been solved or someone is working to
>     solve. Please share your thoughts and add anything else i missed to this
>     list.
>