You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by James Meickle <jm...@quantopian.com> on 2018/05/02 19:00:08 UTC

Improving Airflow SLAs

At Quantopian we use Airflow to produce artifacts based on the previous
day's stock market data. These artifacts are required for us to trade on
today's stock market. Therefore, I've been investing time in improving
Airflow notifications (such as writing PagerDuty and Slack integrations).
My attention has turned to Airflow's SLA system, which has some drawbacks
for our use case:

1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
skipped for this execution date will still trigger emails/callbacks. This
is a huge problem for us because we run almost no tasks on weekends (since
the stock market isn't open).

2) Defining SLAs can be awkward because they are relative to the execution
date instead of the task start time. There's no way to alert if a task runs
for "more than an hour", for any non-trivial DAG. Instead you can only
express "more than an hour from execution date".  The financial data we use
varies in when it arrives, and how long it takes to process (data volume
changes frequently); we also have tight timelines that make retries
difficult, so we want to alert an operator while leaving the task running,
rather than failing and then alerting.

3) SLA miss emails don't have a subject line containing the instance URL
(important for us because we run the same DAGs in both staging/production)
or the execution date they apply to. When opened, they can get hard to read
for even a moderately sized DAG because they include a flat list of task
instances that are unsorted (neither alpha nor topo). They are also lacking
any links back to the Airflow instance.

4) SLA emails are not callbacks, and can't be turned off (other than either
removing the SLA or removing the email attribute on the task instance). The
way that SLA miss callbacks are defined is not intuitive, as in contrast to
all other callbacks, they are DAG-level rather than task-level. Also, the
call signature is poorly defined: for instance, two of the arguments are
just strings produced from the other two arguments.

I have some thoughts about ways to fix these issues:

1) I just consider this one a bug. If a task instance is skipped, that was
intentional, and it should not trigger any alerts.

2) I think that the `sla=` parameter should be split into something like
this:

`expected_start`: Timedelta after execution date, representing when this
task must have started by.
`expected_finish`: Timedelta after execution date, representing when this
task must have finished by.
`expected_duration`: Timedelta after task start, representing how long it
is expected to run including all retries.

This would give better operator control over SLAs, particularly for tasks
deeper in larger DAGs where exact ordering may be hard to predict.

3) The emails should be improved to be more operator-friendly, and take
into account that someone may get a callback for a DAG they don't know very
well, or be paged by this notification.

4.1) All Airflow callbacks should support a list, rather than requiring a
single function. (I've written a wrapper that does this, but it would be
better for Airflow to just handle this itself.)

4.2) SLA miss callbacks should be task callbacks that receive context, like
all the other callbacks. Having a DAG figure out which tasks have missed
SLAs collectively is fine, but getting SLA failures in a batched callback
doesn't really make much sense. Per-task callbacks can be fired
individually within a batch of failures detected at the same time.

4.3) SLA emails should be the default SLA miss callback function, rather
than being hardcoded.

Also, overall, the SLA miss logic is very complicated. It's stuffed into
one overloaded function that is responsible for checking for SLA misses,
creating database objects for them, filtering tasks, selecting emails,
rendering, and sending. Refactoring it would be a good maintainability win.

I am already implementing some of the above in a private branch, but I'd be
curious to hear community feedback as to which of these suggestions might
be desirable upstream. I could have this ready for Airflow 2.0 if there is
interest beyond my own use case.

Re: Improving Airflow SLAs

Posted by James Meickle <jm...@quantopian.com>.

That's a very interesting thought, and I have definitely been bitten
multiple times by bugs related to how closely tied SLAs are to the
scheduler:
https://issues.apache.org/jira/browse/AIRFLOW-2178?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20smtp

However, I'm not convinced that adding a new process for a monitoring
service would actually be much better architecturally than just improving
the scheduler codebase. To have high confidence you'd likely want an
external, non-Airflow "is this running" check anyways (for example, we
alert if there are no "heartbeating scheduler" log lines).



On Thu, May 3, 2018 at 2:26 AM, Ananth Durai <va...@gmail.com> wrote:

> Since we are talking about the SLA implementation, The current SLA miss
> implementation is part of the scheduler code. So in the cases like
> scheduler max out the process / not running for some reason, we will miss
> all the SLA alert. It is worth to decouple SLA alert from the scheduler
> path and run as a separate process.
>
>
> Regards,
> Ananth.P,
>
>
>
>
>
>
> On 2 May 2018 at 20:31, David Capwell <dc...@gmail.com> wrote:
>
> > We use SLA as well and works great for some DAGs and painful for others
> >
> > We rely on sensors to validate the data is ready before we run and each
> dag
> > waits on sensors for different times (one dag waits for 8 hours since it
> > expects date at the start of day but tends to get it 8 hours later).  We
> > also have some nested dags that have about 10 tasks deep.
> >
> > In these two cases SLA warnings come very late since the semantics we see
> > is DAG completion time; what we really want is what you were talking
> about,
> > expected execution times
> >
> > Also SLA trigger on backfills and manual reruns of tasks
> >
> > I see this as a critical feature for production monitoring so would love
> to
> > see this get improved
> >
> > On Wed, May 2, 2018, 12:00 PM James Meickle <jm...@quantopian.com>
> > wrote:
> >
> > > At Quantopian we use Airflow to produce artifacts based on the previous
> > > day's stock market data. These artifacts are required for us to trade
> on
> > > today's stock market. Therefore, I've been investing time in improving
> > > Airflow notifications (such as writing PagerDuty and Slack
> integrations).
> > > My attention has turned to Airflow's SLA system, which has some
> drawbacks
> > > for our use case:
> > >
> > > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > > skipped for this execution date will still trigger emails/callbacks.
> This
> > > is a huge problem for us because we run almost no tasks on weekends
> > (since
> > > the stock market isn't open).
> > >
> > > 2) Defining SLAs can be awkward because they are relative to the
> > execution
> > > date instead of the task start time. There's no way to alert if a task
> > runs
> > > for "more than an hour", for any non-trivial DAG. Instead you can only
> > > express "more than an hour from execution date".  The financial data we
> > use
> > > varies in when it arrives, and how long it takes to process (data
> volume
> > > changes frequently); we also have tight timelines that make retries
> > > difficult, so we want to alert an operator while leaving the task
> > running,
> > > rather than failing and then alerting.
> > >
> > > 3) SLA miss emails don't have a subject line containing the instance
> URL
> > > (important for us because we run the same DAGs in both
> > staging/production)
> > > or the execution date they apply to. When opened, they can get hard to
> > read
> > > for even a moderately sized DAG because they include a flat list of
> task
> > > instances that are unsorted (neither alpha nor topo). They are also
> > lacking
> > > any links back to the Airflow instance.
> > >
> > > 4) SLA emails are not callbacks, and can't be turned off (other than
> > either
> > > removing the SLA or removing the email attribute on the task instance).
> > The
> > > way that SLA miss callbacks are defined is not intuitive, as in
> contrast
> > to
> > > all other callbacks, they are DAG-level rather than task-level. Also,
> the
> > > call signature is poorly defined: for instance, two of the arguments
> are
> > > just strings produced from the other two arguments.
> > >
> > > I have some thoughts about ways to fix these issues:
> > >
> > > 1) I just consider this one a bug. If a task instance is skipped, that
> > was
> > > intentional, and it should not trigger any alerts.
> > >
> > > 2) I think that the `sla=` parameter should be split into something
> like
> > > this:
> > >
> > > `expected_start`: Timedelta after execution date, representing when
> this
> > > task must have started by.
> > > `expected_finish`: Timedelta after execution date, representing when
> this
> > > task must have finished by.
> > > `expected_duration`: Timedelta after task start, representing how long
> it
> > > is expected to run including all retries.
> > >
> > > This would give better operator control over SLAs, particularly for
> tasks
> > > deeper in larger DAGs where exact ordering may be hard to predict.
> > >
> > > 3) The emails should be improved to be more operator-friendly, and take
> > > into account that someone may get a callback for a DAG they don't know
> > very
> > > well, or be paged by this notification.
> > >
> > > 4.1) All Airflow callbacks should support a list, rather than
> requiring a
> > > single function. (I've written a wrapper that does this, but it would
> be
> > > better for Airflow to just handle this itself.)
> > >
> > > 4.2) SLA miss callbacks should be task callbacks that receive context,
> > like
> > > all the other callbacks. Having a DAG figure out which tasks have
> missed
> > > SLAs collectively is fine, but getting SLA failures in a batched
> callback
> > > doesn't really make much sense. Per-task callbacks can be fired
> > > individually within a batch of failures detected at the same time.
> > >
> > > 4.3) SLA emails should be the default SLA miss callback function,
> rather
> > > than being hardcoded.
> > >
> > > Also, overall, the SLA miss logic is very complicated. It's stuffed
> into
> > > one overloaded function that is responsible for checking for SLA
> misses,
> > > creating database objects for them, filtering tasks, selecting emails,
> > > rendering, and sending. Refactoring it would be a good maintainability
> > win.
> > >
> > > I am already implementing some of the above in a private branch, but
> I'd
> > be
> > > curious to hear community feedback as to which of these suggestions
> might
> > > be desirable upstream. I could have this ready for Airflow 2.0 if there
> > is
> > > interest beyond my own use case.
> > >
> >
>

Re: Improving Airflow SLAs

Posted by Ananth Durai <va...@gmail.com>.

Since we are talking about the SLA implementation, The current SLA miss
implementation is part of the scheduler code. So in the cases like
scheduler max out the process / not running for some reason, we will miss
all the SLA alert. It is worth to decouple SLA alert from the scheduler
path and run as a separate process.


Regards,
Ananth.P,






On 2 May 2018 at 20:31, David Capwell <dc...@gmail.com> wrote:

> We use SLA as well and works great for some DAGs and painful for others
>
> We rely on sensors to validate the data is ready before we run and each dag
> waits on sensors for different times (one dag waits for 8 hours since it
> expects date at the start of day but tends to get it 8 hours later).  We
> also have some nested dags that have about 10 tasks deep.
>
> In these two cases SLA warnings come very late since the semantics we see
> is DAG completion time; what we really want is what you were talking about,
> expected execution times
>
> Also SLA trigger on backfills and manual reruns of tasks
>
> I see this as a critical feature for production monitoring so would love to
> see this get improved
>
> On Wed, May 2, 2018, 12:00 PM James Meickle <jm...@quantopian.com>
> wrote:
>
> > At Quantopian we use Airflow to produce artifacts based on the previous
> > day's stock market data. These artifacts are required for us to trade on
> > today's stock market. Therefore, I've been investing time in improving
> > Airflow notifications (such as writing PagerDuty and Slack integrations).
> > My attention has turned to Airflow's SLA system, which has some drawbacks
> > for our use case:
> >
> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > skipped for this execution date will still trigger emails/callbacks. This
> > is a huge problem for us because we run almost no tasks on weekends
> (since
> > the stock market isn't open).
> >
> > 2) Defining SLAs can be awkward because they are relative to the
> execution
> > date instead of the task start time. There's no way to alert if a task
> runs
> > for "more than an hour", for any non-trivial DAG. Instead you can only
> > express "more than an hour from execution date".  The financial data we
> use
> > varies in when it arrives, and how long it takes to process (data volume
> > changes frequently); we also have tight timelines that make retries
> > difficult, so we want to alert an operator while leaving the task
> running,
> > rather than failing and then alerting.
> >
> > 3) SLA miss emails don't have a subject line containing the instance URL
> > (important for us because we run the same DAGs in both
> staging/production)
> > or the execution date they apply to. When opened, they can get hard to
> read
> > for even a moderately sized DAG because they include a flat list of task
> > instances that are unsorted (neither alpha nor topo). They are also
> lacking
> > any links back to the Airflow instance.
> >
> > 4) SLA emails are not callbacks, and can't be turned off (other than
> either
> > removing the SLA or removing the email attribute on the task instance).
> The
> > way that SLA miss callbacks are defined is not intuitive, as in contrast
> to
> > all other callbacks, they are DAG-level rather than task-level. Also, the
> > call signature is poorly defined: for instance, two of the arguments are
> > just strings produced from the other two arguments.
> >
> > I have some thoughts about ways to fix these issues:
> >
> > 1) I just consider this one a bug. If a task instance is skipped, that
> was
> > intentional, and it should not trigger any alerts.
> >
> > 2) I think that the `sla=` parameter should be split into something like
> > this:
> >
> > `expected_start`: Timedelta after execution date, representing when this
> > task must have started by.
> > `expected_finish`: Timedelta after execution date, representing when this
> > task must have finished by.
> > `expected_duration`: Timedelta after task start, representing how long it
> > is expected to run including all retries.
> >
> > This would give better operator control over SLAs, particularly for tasks
> > deeper in larger DAGs where exact ordering may be hard to predict.
> >
> > 3) The emails should be improved to be more operator-friendly, and take
> > into account that someone may get a callback for a DAG they don't know
> very
> > well, or be paged by this notification.
> >
> > 4.1) All Airflow callbacks should support a list, rather than requiring a
> > single function. (I've written a wrapper that does this, but it would be
> > better for Airflow to just handle this itself.)
> >
> > 4.2) SLA miss callbacks should be task callbacks that receive context,
> like
> > all the other callbacks. Having a DAG figure out which tasks have missed
> > SLAs collectively is fine, but getting SLA failures in a batched callback
> > doesn't really make much sense. Per-task callbacks can be fired
> > individually within a batch of failures detected at the same time.
> >
> > 4.3) SLA emails should be the default SLA miss callback function, rather
> > than being hardcoded.
> >
> > Also, overall, the SLA miss logic is very complicated. It's stuffed into
> > one overloaded function that is responsible for checking for SLA misses,
> > creating database objects for them, filtering tasks, selecting emails,
> > rendering, and sending. Refactoring it would be a good maintainability
> win.
> >
> > I am already implementing some of the above in a private branch, but I'd
> be
> > curious to hear community feedback as to which of these suggestions might
> > be desirable upstream. I could have this ready for Airflow 2.0 if there
> is
> > interest beyond my own use case.
> >
>

Re: Improving Airflow SLAs

Posted by David Capwell <dc...@gmail.com>.

We use SLA as well and works great for some DAGs and painful for others

We rely on sensors to validate the data is ready before we run and each dag
waits on sensors for different times (one dag waits for 8 hours since it
expects date at the start of day but tends to get it 8 hours later).  We
also have some nested dags that have about 10 tasks deep.

In these two cases SLA warnings come very late since the semantics we see
is DAG completion time; what we really want is what you were talking about,
expected execution times

Also SLA trigger on backfills and manual reruns of tasks

I see this as a critical feature for production monitoring so would love to
see this get improved

On Wed, May 2, 2018, 12:00 PM James Meickle <jm...@quantopian.com> wrote:

> At Quantopian we use Airflow to produce artifacts based on the previous
> day's stock market data. These artifacts are required for us to trade on
> today's stock market. Therefore, I've been investing time in improving
> Airflow notifications (such as writing PagerDuty and Slack integrations).
> My attention has turned to Airflow's SLA system, which has some drawbacks
> for our use case:
>
> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> skipped for this execution date will still trigger emails/callbacks. This
> is a huge problem for us because we run almost no tasks on weekends (since
> the stock market isn't open).
>
> 2) Defining SLAs can be awkward because they are relative to the execution
> date instead of the task start time. There's no way to alert if a task runs
> for "more than an hour", for any non-trivial DAG. Instead you can only
> express "more than an hour from execution date".  The financial data we use
> varies in when it arrives, and how long it takes to process (data volume
> changes frequently); we also have tight timelines that make retries
> difficult, so we want to alert an operator while leaving the task running,
> rather than failing and then alerting.
>
> 3) SLA miss emails don't have a subject line containing the instance URL
> (important for us because we run the same DAGs in both staging/production)
> or the execution date they apply to. When opened, they can get hard to read
> for even a moderately sized DAG because they include a flat list of task
> instances that are unsorted (neither alpha nor topo). They are also lacking
> any links back to the Airflow instance.
>
> 4) SLA emails are not callbacks, and can't be turned off (other than either
> removing the SLA or removing the email attribute on the task instance). The
> way that SLA miss callbacks are defined is not intuitive, as in contrast to
> all other callbacks, they are DAG-level rather than task-level. Also, the
> call signature is poorly defined: for instance, two of the arguments are
> just strings produced from the other two arguments.
>
> I have some thoughts about ways to fix these issues:
>
> 1) I just consider this one a bug. If a task instance is skipped, that was
> intentional, and it should not trigger any alerts.
>
> 2) I think that the `sla=` parameter should be split into something like
> this:
>
> `expected_start`: Timedelta after execution date, representing when this
> task must have started by.
> `expected_finish`: Timedelta after execution date, representing when this
> task must have finished by.
> `expected_duration`: Timedelta after task start, representing how long it
> is expected to run including all retries.
>
> This would give better operator control over SLAs, particularly for tasks
> deeper in larger DAGs where exact ordering may be hard to predict.
>
> 3) The emails should be improved to be more operator-friendly, and take
> into account that someone may get a callback for a DAG they don't know very
> well, or be paged by this notification.
>
> 4.1) All Airflow callbacks should support a list, rather than requiring a
> single function. (I've written a wrapper that does this, but it would be
> better for Airflow to just handle this itself.)
>
> 4.2) SLA miss callbacks should be task callbacks that receive context, like
> all the other callbacks. Having a DAG figure out which tasks have missed
> SLAs collectively is fine, but getting SLA failures in a batched callback
> doesn't really make much sense. Per-task callbacks can be fired
> individually within a batch of failures detected at the same time.
>
> 4.3) SLA emails should be the default SLA miss callback function, rather
> than being hardcoded.
>
> Also, overall, the SLA miss logic is very complicated. It's stuffed into
> one overloaded function that is responsible for checking for SLA misses,
> creating database objects for them, filtering tasks, selecting emails,
> rendering, and sending. Refactoring it would be a good maintainability win.
>
> I am already implementing some of the above in a private branch, but I'd be
> curious to hear community feedback as to which of these suggestions might
> be desirable upstream. I could have this ready for Airflow 2.0 if there is
> interest beyond my own use case.
>

Re: Improving Airflow SLAs

Posted by James Meickle <jm...@quantopian.com>.

Hm, not sure I understand your question. To my mind this use case already
isn't possible because while the callback function is currently a DAG
attribute, the SLA value has always been set on specific tasks within the
DAG. Perhaps this gets obscured by the current SLA miss email reporting
downstream tasks, but it's an individual task's SLA (or more than one task
if they got batched) that was triggering that email, not the DAG's.

I'm not sure it's worth adding DAG-level SLAs too since that would likely
result in confusion with task-level SLAs. I feel like for most use cases,
when the DAGRun starts/finishes is more of an implementation detail, and
the actual concern is for specific tasks (even if it's only the first or
the last tasks in your pipeline). Open to discussion on that though!

On Thu, May 24, 2018 at 3:16 PM, Ace Haidrey <ac...@gmail.com> wrote:

> Hi James,
> I haven’t read everything or looked at the entire PR yet but one thing I
> wanted to ask was you state you move the SLA miss callback to the task
> level. In our org and I can imagine in others, we would like to have the
> callback stay at the DAG level so we can see if the entire pipeline is
> taking longer than X hours, not just if each task is taking more than X
> hours. Is this still possible, to do that feasibly?
>
> > On May 24, 2018, at 12:12 PM, James Meickle <jm...@quantopian.com>
> wrote:
> >
> > Just giving this a bump; it's a pretty major rework so I'd love to know
> > whether this effort is likely to be accepted if I bring it to a PR-able
> > state, before I invest more time.
> >
> > On Wed, May 23, 2018 at 1:59 PM, James Meickle <jm...@quantopian.com>
> > wrote:
> >
> >> Hi folks,
> >>
> >> I've created a branch off of v1-10-test; the diff can be found here:
> >> https://github.com/apache/incubator-airflow/compare/v1-
> >> 10-test...Eronarn:sla_improvements
> >>
> >> As a recap, this work is expected to do the following:
> >>
> >> - split the "sla" parameter into three independent SLAs: expected
> >> duration, expected start, and expected finish
> >> - move the SLA miss callback to be a task-level attribute rather than
> >> DAG-level (removing a lot of the "batching" functionality)
> >> - convert the SLA miss email to the default SLA miss callback
> >> - add a "type" to SLA misses, which will be part of the primary key, and
> >> can be checked against in the callback to respond appropriately to the
> type
> >> of SLA that was missed.
> >> - don't send SLA misses for skipped tasks, or for backfill jobs
> >>
> >> Before I polish up the remaining TODO functions and write a migration
> and
> >> tests, I'd appreciate feedback from the maintainers as to whether this
> >> seems to be on the right track, design-wise. (Note that it's definitely
> not
> >> going to pass tests right now; I am having significant problems getting
> >> Airflow's test suite running locally so I'm not even attempting at the
> >> moment.)
> >>
> >> Thanks,
> >>
> >> -James M.
> >>
> >> On Wed, May 9, 2018 at 12:43 PM, James Meickle <jmeickle@quantopian.com
> >
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Since the response so far has been positive or neutral, I intend to
> >>> submit one or more PRs targeting 2.0 (I think that some parts will be
> >>> separable from a larger SLA refactor). I intend to address at least the
> >>> following JIRA issues:
> >>>
> >>> https://issues.apache.org/jira/browse/AIRFLOW-2236
> >>> https://issues.apache.org/jira/browse/AIRFLOW-1472
> >>> https://issues.apache.org/jira/browse/AIRFLOW-1360
> >>> https://issues.apache.org/jira/browse/AIRFLOW-557
> >>> https://issues.apache.org/jira/browse/AIRFLOW-133
> >>>
> >>> Regards,
> >>>
> >>> -James M.
> >>>
> >>>
> >>>
> >>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
> >>> maximebeauchemin@gmail.com> wrote:
> >>>
> >>>> About de-coupling the SLA management process, I've had conversations
> in
> >>>> the
> >>>> direction of renaming the scheduler to "supervisor" to reflect the
> fact
> >>>> that it's not just scheduling processes, it does a lot more tasks than
> >>>> just
> >>>> that, SLA management being one of them.
> >>>>
> >>>> I still think the default should be to require a single supervisor
> that
> >>>> would do all the "supervision" work though. I'm generally against
> >>>> requiring
> >>>> more types of nodes on the cluster. But perhaps the supervisor could
> have
> >>>> switches to be started in modes where it would only do a subset of its
> >>>> tasks, so that people can run multiple specialized supervisor nodes if
> >>>> they
> >>>> want to.
> >>>>
> >>>> For the record, I was thinking that renaming the scheduler to
> supervisor
> >>>> would likely happen as we re-write it to enable multiple concurrent
> >>>> supervisor processes. It turns out that parallelizing the scheduler
> >>>> hasn't
> >>>> been as critical as I thought it would be originally, especially with
> the
> >>>> current multi-process scheduler. Sounds like the community is getting
> a
> >>>> lot
> >>>> of mileage out of this current multi-process scheduler.
> >>>>
> >>>> Max
> >>>>
> >>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <ji...@optiver.com>
> >>>> wrote:
> >>>>
> >>>>> I would love to see this proposal gets implemented in airflow.
> >>>>> In our case duration based SLA makes much more sense and I ended up
> >>>> adding
> >>>>> a decorator to the execute method in our custom operators.
> >>>>>
> >>>>> Best regards,
> >>>>> Jiening
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: James Meickle [mailto:jmeickle@quantopian.com]
> >>>>> Sent: Wednesday 02 May 2018 9:00 PM
> >>>>> To: dev@airflow.incubator.apache.org
> >>>>> Subject: Improving Airflow SLAs [External]
> >>>>>
> >>>>> At Quantopian we use Airflow to produce artifacts based on the
> previous
> >>>>> day's stock market data. These artifacts are required for us to trade
> >>>> on
> >>>>> today's stock market. Therefore, I've been investing time in
> improving
> >>>>> Airflow notifications (such as writing PagerDuty and Slack
> >>>> integrations).
> >>>>> My attention has turned to Airflow's SLA system, which has some
> >>>> drawbacks
> >>>>> for our use case:
> >>>>>
> >>>>> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> >>>>> skipped for this execution date will still trigger emails/callbacks.
> >>>> This
> >>>>> is a huge problem for us because we run almost no tasks on weekends
> >>>> (since
> >>>>> the stock market isn't open).
> >>>>>
> >>>>> 2) Defining SLAs can be awkward because they are relative to the
> >>>> execution
> >>>>> date instead of the task start time. There's no way to alert if a
> task
> >>>> runs
> >>>>> for "more than an hour", for any non-trivial DAG. Instead you can
> only
> >>>>> express "more than an hour from execution date".  The financial data
> >>>> we use
> >>>>> varies in when it arrives, and how long it takes to process (data
> >>>> volume
> >>>>> changes frequently); we also have tight timelines that make retries
> >>>>> difficult, so we want to alert an operator while leaving the task
> >>>> running,
> >>>>> rather than failing and then alerting.
> >>>>>
> >>>>> 3) SLA miss emails don't have a subject line containing the instance
> >>>> URL
> >>>>> (important for us because we run the same DAGs in both
> >>>> staging/production)
> >>>>> or the execution date they apply to. When opened, they can get hard
> to
> >>>> read
> >>>>> for even a moderately sized DAG because they include a flat list of
> >>>> task
> >>>>> instances that are unsorted (neither alpha nor topo). They are also
> >>>> lacking
> >>>>> any links back to the Airflow instance.
> >>>>>
> >>>>> 4) SLA emails are not callbacks, and can't be turned off (other than
> >>>> either
> >>>>> removing the SLA or removing the email attribute on the task
> >>>> instance). The
> >>>>> way that SLA miss callbacks are defined is not intuitive, as in
> >>>> contrast to
> >>>>> all other callbacks, they are DAG-level rather than task-level. Also,
> >>>> the
> >>>>> call signature is poorly defined: for instance, two of the arguments
> >>>> are
> >>>>> just strings produced from the other two arguments.
> >>>>>
> >>>>> I have some thoughts about ways to fix these issues:
> >>>>>
> >>>>> 1) I just consider this one a bug. If a task instance is skipped,
> that
> >>>> was
> >>>>> intentional, and it should not trigger any alerts.
> >>>>>
> >>>>> 2) I think that the `sla=` parameter should be split into something
> >>>> like
> >>>>> this:
> >>>>>
> >>>>> `expected_start`: Timedelta after execution date, representing when
> >>>> this
> >>>>> task must have started by.
> >>>>> `expected_finish`: Timedelta after execution date, representing when
> >>>> this
> >>>>> task must have finished by.
> >>>>> `expected_duration`: Timedelta after task start, representing how
> long
> >>>> it
> >>>>> is expected to run including all retries.
> >>>>>
> >>>>> This would give better operator control over SLAs, particularly for
> >>>> tasks
> >>>>> deeper in larger DAGs where exact ordering may be hard to predict.
> >>>>>
> >>>>> 3) The emails should be improved to be more operator-friendly, and
> take
> >>>>> into account that someone may get a callback for a DAG they don't
> know
> >>>> very
> >>>>> well, or be paged by this notification.
> >>>>>
> >>>>> 4.1) All Airflow callbacks should support a list, rather than
> >>>> requiring a
> >>>>> single function. (I've written a wrapper that does this, but it would
> >>>> be
> >>>>> better for Airflow to just handle this itself.)
> >>>>>
> >>>>> 4.2) SLA miss callbacks should be task callbacks that receive
> context,
> >>>> like
> >>>>> all the other callbacks. Having a DAG figure out which tasks have
> >>>> missed
> >>>>> SLAs collectively is fine, but getting SLA failures in a batched
> >>>> callback
> >>>>> doesn't really make much sense. Per-task callbacks can be fired
> >>>>> individually within a batch of failures detected at the same time.
> >>>>>
> >>>>> 4.3) SLA emails should be the default SLA miss callback function,
> >>>> rather
> >>>>> than being hardcoded.
> >>>>>
> >>>>> Also, overall, the SLA miss logic is very complicated. It's stuffed
> >>>> into
> >>>>> one overloaded function that is responsible for checking for SLA
> >>>> misses,
> >>>>> creating database objects for them, filtering tasks, selecting
> emails,
> >>>>> rendering, and sending. Refactoring it would be a good
> maintainability
> >>>> win.
> >>>>>
> >>>>> I am already implementing some of the above in a private branch, but
> >>>> I'd be
> >>>>> curious to hear community feedback as to which of these suggestions
> >>>> might
> >>>>> be desirable upstream. I could have this ready for Airflow 2.0 if
> >>>> there is
> >>>>> interest beyond my own use case.
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: Improving Airflow SLAs

Posted by Ace Haidrey <ac...@gmail.com>.

Hi James,
I haven’t read everything or looked at the entire PR yet but one thing I wanted to ask was you state you move the SLA miss callback to the task level. In our org and I can imagine in others, we would like to have the callback stay at the DAG level so we can see if the entire pipeline is taking longer than X hours, not just if each task is taking more than X hours. Is this still possible, to do that feasibly?

> On May 24, 2018, at 12:12 PM, James Meickle <jm...@quantopian.com> wrote:
> 
> Just giving this a bump; it's a pretty major rework so I'd love to know
> whether this effort is likely to be accepted if I bring it to a PR-able
> state, before I invest more time.
> 
> On Wed, May 23, 2018 at 1:59 PM, James Meickle <jm...@quantopian.com>
> wrote:
> 
>> Hi folks,
>> 
>> I've created a branch off of v1-10-test; the diff can be found here:
>> https://github.com/apache/incubator-airflow/compare/v1-
>> 10-test...Eronarn:sla_improvements
>> 
>> As a recap, this work is expected to do the following:
>> 
>> - split the "sla" parameter into three independent SLAs: expected
>> duration, expected start, and expected finish
>> - move the SLA miss callback to be a task-level attribute rather than
>> DAG-level (removing a lot of the "batching" functionality)
>> - convert the SLA miss email to the default SLA miss callback
>> - add a "type" to SLA misses, which will be part of the primary key, and
>> can be checked against in the callback to respond appropriately to the type
>> of SLA that was missed.
>> - don't send SLA misses for skipped tasks, or for backfill jobs
>> 
>> Before I polish up the remaining TODO functions and write a migration and
>> tests, I'd appreciate feedback from the maintainers as to whether this
>> seems to be on the right track, design-wise. (Note that it's definitely not
>> going to pass tests right now; I am having significant problems getting
>> Airflow's test suite running locally so I'm not even attempting at the
>> moment.)
>> 
>> Thanks,
>> 
>> -James M.
>> 
>> On Wed, May 9, 2018 at 12:43 PM, James Meickle <jm...@quantopian.com>
>> wrote:
>> 
>>> Hi all,
>>> 
>>> Since the response so far has been positive or neutral, I intend to
>>> submit one or more PRs targeting 2.0 (I think that some parts will be
>>> separable from a larger SLA refactor). I intend to address at least the
>>> following JIRA issues:
>>> 
>>> https://issues.apache.org/jira/browse/AIRFLOW-2236
>>> https://issues.apache.org/jira/browse/AIRFLOW-1472
>>> https://issues.apache.org/jira/browse/AIRFLOW-1360
>>> https://issues.apache.org/jira/browse/AIRFLOW-557
>>> https://issues.apache.org/jira/browse/AIRFLOW-133
>>> 
>>> Regards,
>>> 
>>> -James M.
>>> 
>>> 
>>> 
>>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
>>> maximebeauchemin@gmail.com> wrote:
>>> 
>>>> About de-coupling the SLA management process, I've had conversations in
>>>> the
>>>> direction of renaming the scheduler to "supervisor" to reflect the fact
>>>> that it's not just scheduling processes, it does a lot more tasks than
>>>> just
>>>> that, SLA management being one of them.
>>>> 
>>>> I still think the default should be to require a single supervisor that
>>>> would do all the "supervision" work though. I'm generally against
>>>> requiring
>>>> more types of nodes on the cluster. But perhaps the supervisor could have
>>>> switches to be started in modes where it would only do a subset of its
>>>> tasks, so that people can run multiple specialized supervisor nodes if
>>>> they
>>>> want to.
>>>> 
>>>> For the record, I was thinking that renaming the scheduler to supervisor
>>>> would likely happen as we re-write it to enable multiple concurrent
>>>> supervisor processes. It turns out that parallelizing the scheduler
>>>> hasn't
>>>> been as critical as I thought it would be originally, especially with the
>>>> current multi-process scheduler. Sounds like the community is getting a
>>>> lot
>>>> of mileage out of this current multi-process scheduler.
>>>> 
>>>> Max
>>>> 
>>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <ji...@optiver.com>
>>>> wrote:
>>>> 
>>>>> I would love to see this proposal gets implemented in airflow.
>>>>> In our case duration based SLA makes much more sense and I ended up
>>>> adding
>>>>> a decorator to the execute method in our custom operators.
>>>>> 
>>>>> Best regards,
>>>>> Jiening
>>>>> 
>>>>> -----Original Message-----
>>>>> From: James Meickle [mailto:jmeickle@quantopian.com]
>>>>> Sent: Wednesday 02 May 2018 9:00 PM
>>>>> To: dev@airflow.incubator.apache.org
>>>>> Subject: Improving Airflow SLAs [External]
>>>>> 
>>>>> At Quantopian we use Airflow to produce artifacts based on the previous
>>>>> day's stock market data. These artifacts are required for us to trade
>>>> on
>>>>> today's stock market. Therefore, I've been investing time in improving
>>>>> Airflow notifications (such as writing PagerDuty and Slack
>>>> integrations).
>>>>> My attention has turned to Airflow's SLA system, which has some
>>>> drawbacks
>>>>> for our use case:
>>>>> 
>>>>> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
>>>>> skipped for this execution date will still trigger emails/callbacks.
>>>> This
>>>>> is a huge problem for us because we run almost no tasks on weekends
>>>> (since
>>>>> the stock market isn't open).
>>>>> 
>>>>> 2) Defining SLAs can be awkward because they are relative to the
>>>> execution
>>>>> date instead of the task start time. There's no way to alert if a task
>>>> runs
>>>>> for "more than an hour", for any non-trivial DAG. Instead you can only
>>>>> express "more than an hour from execution date".  The financial data
>>>> we use
>>>>> varies in when it arrives, and how long it takes to process (data
>>>> volume
>>>>> changes frequently); we also have tight timelines that make retries
>>>>> difficult, so we want to alert an operator while leaving the task
>>>> running,
>>>>> rather than failing and then alerting.
>>>>> 
>>>>> 3) SLA miss emails don't have a subject line containing the instance
>>>> URL
>>>>> (important for us because we run the same DAGs in both
>>>> staging/production)
>>>>> or the execution date they apply to. When opened, they can get hard to
>>>> read
>>>>> for even a moderately sized DAG because they include a flat list of
>>>> task
>>>>> instances that are unsorted (neither alpha nor topo). They are also
>>>> lacking
>>>>> any links back to the Airflow instance.
>>>>> 
>>>>> 4) SLA emails are not callbacks, and can't be turned off (other than
>>>> either
>>>>> removing the SLA or removing the email attribute on the task
>>>> instance). The
>>>>> way that SLA miss callbacks are defined is not intuitive, as in
>>>> contrast to
>>>>> all other callbacks, they are DAG-level rather than task-level. Also,
>>>> the
>>>>> call signature is poorly defined: for instance, two of the arguments
>>>> are
>>>>> just strings produced from the other two arguments.
>>>>> 
>>>>> I have some thoughts about ways to fix these issues:
>>>>> 
>>>>> 1) I just consider this one a bug. If a task instance is skipped, that
>>>> was
>>>>> intentional, and it should not trigger any alerts.
>>>>> 
>>>>> 2) I think that the `sla=` parameter should be split into something
>>>> like
>>>>> this:
>>>>> 
>>>>> `expected_start`: Timedelta after execution date, representing when
>>>> this
>>>>> task must have started by.
>>>>> `expected_finish`: Timedelta after execution date, representing when
>>>> this
>>>>> task must have finished by.
>>>>> `expected_duration`: Timedelta after task start, representing how long
>>>> it
>>>>> is expected to run including all retries.
>>>>> 
>>>>> This would give better operator control over SLAs, particularly for
>>>> tasks
>>>>> deeper in larger DAGs where exact ordering may be hard to predict.
>>>>> 
>>>>> 3) The emails should be improved to be more operator-friendly, and take
>>>>> into account that someone may get a callback for a DAG they don't know
>>>> very
>>>>> well, or be paged by this notification.
>>>>> 
>>>>> 4.1) All Airflow callbacks should support a list, rather than
>>>> requiring a
>>>>> single function. (I've written a wrapper that does this, but it would
>>>> be
>>>>> better for Airflow to just handle this itself.)
>>>>> 
>>>>> 4.2) SLA miss callbacks should be task callbacks that receive context,
>>>> like
>>>>> all the other callbacks. Having a DAG figure out which tasks have
>>>> missed
>>>>> SLAs collectively is fine, but getting SLA failures in a batched
>>>> callback
>>>>> doesn't really make much sense. Per-task callbacks can be fired
>>>>> individually within a batch of failures detected at the same time.
>>>>> 
>>>>> 4.3) SLA emails should be the default SLA miss callback function,
>>>> rather
>>>>> than being hardcoded.
>>>>> 
>>>>> Also, overall, the SLA miss logic is very complicated. It's stuffed
>>>> into
>>>>> one overloaded function that is responsible for checking for SLA
>>>> misses,
>>>>> creating database objects for them, filtering tasks, selecting emails,
>>>>> rendering, and sending. Refactoring it would be a good maintainability
>>>> win.
>>>>> 
>>>>> I am already implementing some of the above in a private branch, but
>>>> I'd be
>>>>> curious to hear community feedback as to which of these suggestions
>>>> might
>>>>> be desirable upstream. I could have this ready for Airflow 2.0 if
>>>> there is
>>>>> interest beyond my own use case.
>>>>> 
>>>> 
>>> 
>>> 
>>

Re: Improving Airflow SLAs

Posted by James Meickle <jm...@quantopian.com>.

Just giving this a bump; it's a pretty major rework so I'd love to know
whether this effort is likely to be accepted if I bring it to a PR-able
state, before I invest more time.

On Wed, May 23, 2018 at 1:59 PM, James Meickle <jm...@quantopian.com>
wrote:

> Hi folks,
>
> I've created a branch off of v1-10-test; the diff can be found here:
> https://github.com/apache/incubator-airflow/compare/v1-
> 10-test...Eronarn:sla_improvements
>
> As a recap, this work is expected to do the following:
>
> - split the "sla" parameter into three independent SLAs: expected
> duration, expected start, and expected finish
> - move the SLA miss callback to be a task-level attribute rather than
> DAG-level (removing a lot of the "batching" functionality)
> - convert the SLA miss email to the default SLA miss callback
> - add a "type" to SLA misses, which will be part of the primary key, and
> can be checked against in the callback to respond appropriately to the type
> of SLA that was missed.
> - don't send SLA misses for skipped tasks, or for backfill jobs
>
> Before I polish up the remaining TODO functions and write a migration and
> tests, I'd appreciate feedback from the maintainers as to whether this
> seems to be on the right track, design-wise. (Note that it's definitely not
> going to pass tests right now; I am having significant problems getting
> Airflow's test suite running locally so I'm not even attempting at the
> moment.)
>
> Thanks,
>
> -James M.
>
> On Wed, May 9, 2018 at 12:43 PM, James Meickle <jm...@quantopian.com>
> wrote:
>
>> Hi all,
>>
>> Since the response so far has been positive or neutral, I intend to
>> submit one or more PRs targeting 2.0 (I think that some parts will be
>> separable from a larger SLA refactor). I intend to address at least the
>> following JIRA issues:
>>
>> https://issues.apache.org/jira/browse/AIRFLOW-2236
>> https://issues.apache.org/jira/browse/AIRFLOW-1472
>> https://issues.apache.org/jira/browse/AIRFLOW-1360
>> https://issues.apache.org/jira/browse/AIRFLOW-557
>> https://issues.apache.org/jira/browse/AIRFLOW-133
>>
>> Regards,
>>
>> -James M.
>>
>>
>>
>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
>> maximebeauchemin@gmail.com> wrote:
>>
>>> About de-coupling the SLA management process, I've had conversations in
>>> the
>>> direction of renaming the scheduler to "supervisor" to reflect the fact
>>> that it's not just scheduling processes, it does a lot more tasks than
>>> just
>>> that, SLA management being one of them.
>>>
>>> I still think the default should be to require a single supervisor that
>>> would do all the "supervision" work though. I'm generally against
>>> requiring
>>> more types of nodes on the cluster. But perhaps the supervisor could have
>>> switches to be started in modes where it would only do a subset of its
>>> tasks, so that people can run multiple specialized supervisor nodes if
>>> they
>>> want to.
>>>
>>> For the record, I was thinking that renaming the scheduler to supervisor
>>> would likely happen as we re-write it to enable multiple concurrent
>>> supervisor processes. It turns out that parallelizing the scheduler
>>> hasn't
>>> been as critical as I thought it would be originally, especially with the
>>> current multi-process scheduler. Sounds like the community is getting a
>>> lot
>>> of mileage out of this current multi-process scheduler.
>>>
>>> Max
>>>
>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <ji...@optiver.com>
>>> wrote:
>>>
>>> > I would love to see this proposal gets implemented in airflow.
>>> > In our case duration based SLA makes much more sense and I ended up
>>> adding
>>> > a decorator to the execute method in our custom operators.
>>> >
>>> > Best regards,
>>> > Jiening
>>> >
>>> > -----Original Message-----
>>> > From: James Meickle [mailto:jmeickle@quantopian.com]
>>> > Sent: Wednesday 02 May 2018 9:00 PM
>>> > To: dev@airflow.incubator.apache.org
>>> > Subject: Improving Airflow SLAs [External]
>>> >
>>> > At Quantopian we use Airflow to produce artifacts based on the previous
>>> > day's stock market data. These artifacts are required for us to trade
>>> on
>>> > today's stock market. Therefore, I've been investing time in improving
>>> > Airflow notifications (such as writing PagerDuty and Slack
>>> integrations).
>>> > My attention has turned to Airflow's SLA system, which has some
>>> drawbacks
>>> > for our use case:
>>> >
>>> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
>>> > skipped for this execution date will still trigger emails/callbacks.
>>> This
>>> > is a huge problem for us because we run almost no tasks on weekends
>>> (since
>>> > the stock market isn't open).
>>> >
>>> > 2) Defining SLAs can be awkward because they are relative to the
>>> execution
>>> > date instead of the task start time. There's no way to alert if a task
>>> runs
>>> > for "more than an hour", for any non-trivial DAG. Instead you can only
>>> > express "more than an hour from execution date".  The financial data
>>> we use
>>> > varies in when it arrives, and how long it takes to process (data
>>> volume
>>> > changes frequently); we also have tight timelines that make retries
>>> > difficult, so we want to alert an operator while leaving the task
>>> running,
>>> > rather than failing and then alerting.
>>> >
>>> > 3) SLA miss emails don't have a subject line containing the instance
>>> URL
>>> > (important for us because we run the same DAGs in both
>>> staging/production)
>>> > or the execution date they apply to. When opened, they can get hard to
>>> read
>>> > for even a moderately sized DAG because they include a flat list of
>>> task
>>> > instances that are unsorted (neither alpha nor topo). They are also
>>> lacking
>>> > any links back to the Airflow instance.
>>> >
>>> > 4) SLA emails are not callbacks, and can't be turned off (other than
>>> either
>>> > removing the SLA or removing the email attribute on the task
>>> instance). The
>>> > way that SLA miss callbacks are defined is not intuitive, as in
>>> contrast to
>>> > all other callbacks, they are DAG-level rather than task-level. Also,
>>> the
>>> > call signature is poorly defined: for instance, two of the arguments
>>> are
>>> > just strings produced from the other two arguments.
>>> >
>>> > I have some thoughts about ways to fix these issues:
>>> >
>>> > 1) I just consider this one a bug. If a task instance is skipped, that
>>> was
>>> > intentional, and it should not trigger any alerts.
>>> >
>>> > 2) I think that the `sla=` parameter should be split into something
>>> like
>>> > this:
>>> >
>>> > `expected_start`: Timedelta after execution date, representing when
>>> this
>>> > task must have started by.
>>> > `expected_finish`: Timedelta after execution date, representing when
>>> this
>>> > task must have finished by.
>>> > `expected_duration`: Timedelta after task start, representing how long
>>> it
>>> > is expected to run including all retries.
>>> >
>>> > This would give better operator control over SLAs, particularly for
>>> tasks
>>> > deeper in larger DAGs where exact ordering may be hard to predict.
>>> >
>>> > 3) The emails should be improved to be more operator-friendly, and take
>>> > into account that someone may get a callback for a DAG they don't know
>>> very
>>> > well, or be paged by this notification.
>>> >
>>> > 4.1) All Airflow callbacks should support a list, rather than
>>> requiring a
>>> > single function. (I've written a wrapper that does this, but it would
>>> be
>>> > better for Airflow to just handle this itself.)
>>> >
>>> > 4.2) SLA miss callbacks should be task callbacks that receive context,
>>> like
>>> > all the other callbacks. Having a DAG figure out which tasks have
>>> missed
>>> > SLAs collectively is fine, but getting SLA failures in a batched
>>> callback
>>> > doesn't really make much sense. Per-task callbacks can be fired
>>> > individually within a batch of failures detected at the same time.
>>> >
>>> > 4.3) SLA emails should be the default SLA miss callback function,
>>> rather
>>> > than being hardcoded.
>>> >
>>> > Also, overall, the SLA miss logic is very complicated. It's stuffed
>>> into
>>> > one overloaded function that is responsible for checking for SLA
>>> misses,
>>> > creating database objects for them, filtering tasks, selecting emails,
>>> > rendering, and sending. Refactoring it would be a good maintainability
>>> win.
>>> >
>>> > I am already implementing some of the above in a private branch, but
>>> I'd be
>>> > curious to hear community feedback as to which of these suggestions
>>> might
>>> > be desirable upstream. I could have this ready for Airflow 2.0 if
>>> there is
>>> > interest beyond my own use case.
>>> >
>>>
>>
>>
>

Re: Improving Airflow SLAs

Posted by James Meickle <jm...@quantopian.com>.

Hi folks,

I've created a branch off of v1-10-test; the diff can be found here:
https://github.com/apache/incubator-airflow/compare/v1-10-test...Eronarn:sla_improvements

As a recap, this work is expected to do the following:

- split the "sla" parameter into three independent SLAs: expected duration,
expected start, and expected finish
- move the SLA miss callback to be a task-level attribute rather than
DAG-level (removing a lot of the "batching" functionality)
- convert the SLA miss email to the default SLA miss callback
- add a "type" to SLA misses, which will be part of the primary key, and
can be checked against in the callback to respond appropriately to the type
of SLA that was missed.
- don't send SLA misses for skipped tasks, or for backfill jobs

Before I polish up the remaining TODO functions and write a migration and
tests, I'd appreciate feedback from the maintainers as to whether this
seems to be on the right track, design-wise. (Note that it's definitely not
going to pass tests right now; I am having significant problems getting
Airflow's test suite running locally so I'm not even attempting at the
moment.)

Thanks,

-James M.

On Wed, May 9, 2018 at 12:43 PM, James Meickle <jm...@quantopian.com>
wrote:

> Hi all,
>
> Since the response so far has been positive or neutral, I intend to submit
> one or more PRs targeting 2.0 (I think that some parts will be separable
> from a larger SLA refactor). I intend to address at least the following
> JIRA issues:
>
> https://issues.apache.org/jira/browse/AIRFLOW-2236
> https://issues.apache.org/jira/browse/AIRFLOW-1472
> https://issues.apache.org/jira/browse/AIRFLOW-1360
> https://issues.apache.org/jira/browse/AIRFLOW-557
> https://issues.apache.org/jira/browse/AIRFLOW-133
>
> Regards,
>
> -James M.
>
>
>
> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> About de-coupling the SLA management process, I've had conversations in
>> the
>> direction of renaming the scheduler to "supervisor" to reflect the fact
>> that it's not just scheduling processes, it does a lot more tasks than
>> just
>> that, SLA management being one of them.
>>
>> I still think the default should be to require a single supervisor that
>> would do all the "supervision" work though. I'm generally against
>> requiring
>> more types of nodes on the cluster. But perhaps the supervisor could have
>> switches to be started in modes where it would only do a subset of its
>> tasks, so that people can run multiple specialized supervisor nodes if
>> they
>> want to.
>>
>> For the record, I was thinking that renaming the scheduler to supervisor
>> would likely happen as we re-write it to enable multiple concurrent
>> supervisor processes. It turns out that parallelizing the scheduler hasn't
>> been as critical as I thought it would be originally, especially with the
>> current multi-process scheduler. Sounds like the community is getting a
>> lot
>> of mileage out of this current multi-process scheduler.
>>
>> Max
>>
>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <ji...@optiver.com>
>> wrote:
>>
>> > I would love to see this proposal gets implemented in airflow.
>> > In our case duration based SLA makes much more sense and I ended up
>> adding
>> > a decorator to the execute method in our custom operators.
>> >
>> > Best regards,
>> > Jiening
>> >
>> > -----Original Message-----
>> > From: James Meickle [mailto:jmeickle@quantopian.com]
>> > Sent: Wednesday 02 May 2018 9:00 PM
>> > To: dev@airflow.incubator.apache.org
>> > Subject: Improving Airflow SLAs [External]
>> >
>> > At Quantopian we use Airflow to produce artifacts based on the previous
>> > day's stock market data. These artifacts are required for us to trade on
>> > today's stock market. Therefore, I've been investing time in improving
>> > Airflow notifications (such as writing PagerDuty and Slack
>> integrations).
>> > My attention has turned to Airflow's SLA system, which has some
>> drawbacks
>> > for our use case:
>> >
>> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
>> > skipped for this execution date will still trigger emails/callbacks.
>> This
>> > is a huge problem for us because we run almost no tasks on weekends
>> (since
>> > the stock market isn't open).
>> >
>> > 2) Defining SLAs can be awkward because they are relative to the
>> execution
>> > date instead of the task start time. There's no way to alert if a task
>> runs
>> > for "more than an hour", for any non-trivial DAG. Instead you can only
>> > express "more than an hour from execution date".  The financial data we
>> use
>> > varies in when it arrives, and how long it takes to process (data volume
>> > changes frequently); we also have tight timelines that make retries
>> > difficult, so we want to alert an operator while leaving the task
>> running,
>> > rather than failing and then alerting.
>> >
>> > 3) SLA miss emails don't have a subject line containing the instance URL
>> > (important for us because we run the same DAGs in both
>> staging/production)
>> > or the execution date they apply to. When opened, they can get hard to
>> read
>> > for even a moderately sized DAG because they include a flat list of task
>> > instances that are unsorted (neither alpha nor topo). They are also
>> lacking
>> > any links back to the Airflow instance.
>> >
>> > 4) SLA emails are not callbacks, and can't be turned off (other than
>> either
>> > removing the SLA or removing the email attribute on the task instance).
>> The
>> > way that SLA miss callbacks are defined is not intuitive, as in
>> contrast to
>> > all other callbacks, they are DAG-level rather than task-level. Also,
>> the
>> > call signature is poorly defined: for instance, two of the arguments are
>> > just strings produced from the other two arguments.
>> >
>> > I have some thoughts about ways to fix these issues:
>> >
>> > 1) I just consider this one a bug. If a task instance is skipped, that
>> was
>> > intentional, and it should not trigger any alerts.
>> >
>> > 2) I think that the `sla=` parameter should be split into something like
>> > this:
>> >
>> > `expected_start`: Timedelta after execution date, representing when this
>> > task must have started by.
>> > `expected_finish`: Timedelta after execution date, representing when
>> this
>> > task must have finished by.
>> > `expected_duration`: Timedelta after task start, representing how long
>> it
>> > is expected to run including all retries.
>> >
>> > This would give better operator control over SLAs, particularly for
>> tasks
>> > deeper in larger DAGs where exact ordering may be hard to predict.
>> >
>> > 3) The emails should be improved to be more operator-friendly, and take
>> > into account that someone may get a callback for a DAG they don't know
>> very
>> > well, or be paged by this notification.
>> >
>> > 4.1) All Airflow callbacks should support a list, rather than requiring
>> a
>> > single function. (I've written a wrapper that does this, but it would be
>> > better for Airflow to just handle this itself.)
>> >
>> > 4.2) SLA miss callbacks should be task callbacks that receive context,
>> like
>> > all the other callbacks. Having a DAG figure out which tasks have missed
>> > SLAs collectively is fine, but getting SLA failures in a batched
>> callback
>> > doesn't really make much sense. Per-task callbacks can be fired
>> > individually within a batch of failures detected at the same time.
>> >
>> > 4.3) SLA emails should be the default SLA miss callback function, rather
>> > than being hardcoded.
>> >
>> > Also, overall, the SLA miss logic is very complicated. It's stuffed into
>> > one overloaded function that is responsible for checking for SLA misses,
>> > creating database objects for them, filtering tasks, selecting emails,
>> > rendering, and sending. Refactoring it would be a good maintainability
>> win.
>> >
>> > I am already implementing some of the above in a private branch, but
>> I'd be
>> > curious to hear community feedback as to which of these suggestions
>> might
>> > be desirable upstream. I could have this ready for Airflow 2.0 if there
>> is
>> > interest beyond my own use case.
>> >
>>
>
>

Re: Improving Airflow SLAs

Posted by James Meickle <jm...@quantopian.com>.

Hi all,

Since the response so far has been positive or neutral, I intend to submit
one or more PRs targeting 2.0 (I think that some parts will be separable
from a larger SLA refactor). I intend to address at least the following
JIRA issues:

https://issues.apache.org/jira/browse/AIRFLOW-2236
https://issues.apache.org/jira/browse/AIRFLOW-1472
https://issues.apache.org/jira/browse/AIRFLOW-1360
https://issues.apache.org/jira/browse/AIRFLOW-557
https://issues.apache.org/jira/browse/AIRFLOW-133

Regards,

-James M.



On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> About de-coupling the SLA management process, I've had conversations in the
> direction of renaming the scheduler to "supervisor" to reflect the fact
> that it's not just scheduling processes, it does a lot more tasks than just
> that, SLA management being one of them.
>
> I still think the default should be to require a single supervisor that
> would do all the "supervision" work though. I'm generally against requiring
> more types of nodes on the cluster. But perhaps the supervisor could have
> switches to be started in modes where it would only do a subset of its
> tasks, so that people can run multiple specialized supervisor nodes if they
> want to.
>
> For the record, I was thinking that renaming the scheduler to supervisor
> would likely happen as we re-write it to enable multiple concurrent
> supervisor processes. It turns out that parallelizing the scheduler hasn't
> been as critical as I thought it would be originally, especially with the
> current multi-process scheduler. Sounds like the community is getting a lot
> of mileage out of this current multi-process scheduler.
>
> Max
>
> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <ji...@optiver.com>
> wrote:
>
> > I would love to see this proposal gets implemented in airflow.
> > In our case duration based SLA makes much more sense and I ended up
> adding
> > a decorator to the execute method in our custom operators.
> >
> > Best regards,
> > Jiening
> >
> > -----Original Message-----
> > From: James Meickle [mailto:jmeickle@quantopian.com]
> > Sent: Wednesday 02 May 2018 9:00 PM
> > To: dev@airflow.incubator.apache.org
> > Subject: Improving Airflow SLAs [External]
> >
> > At Quantopian we use Airflow to produce artifacts based on the previous
> > day's stock market data. These artifacts are required for us to trade on
> > today's stock market. Therefore, I've been investing time in improving
> > Airflow notifications (such as writing PagerDuty and Slack integrations).
> > My attention has turned to Airflow's SLA system, which has some drawbacks
> > for our use case:
> >
> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > skipped for this execution date will still trigger emails/callbacks. This
> > is a huge problem for us because we run almost no tasks on weekends
> (since
> > the stock market isn't open).
> >
> > 2) Defining SLAs can be awkward because they are relative to the
> execution
> > date instead of the task start time. There's no way to alert if a task
> runs
> > for "more than an hour", for any non-trivial DAG. Instead you can only
> > express "more than an hour from execution date".  The financial data we
> use
> > varies in when it arrives, and how long it takes to process (data volume
> > changes frequently); we also have tight timelines that make retries
> > difficult, so we want to alert an operator while leaving the task
> running,
> > rather than failing and then alerting.
> >
> > 3) SLA miss emails don't have a subject line containing the instance URL
> > (important for us because we run the same DAGs in both
> staging/production)
> > or the execution date they apply to. When opened, they can get hard to
> read
> > for even a moderately sized DAG because they include a flat list of task
> > instances that are unsorted (neither alpha nor topo). They are also
> lacking
> > any links back to the Airflow instance.
> >
> > 4) SLA emails are not callbacks, and can't be turned off (other than
> either
> > removing the SLA or removing the email attribute on the task instance).
> The
> > way that SLA miss callbacks are defined is not intuitive, as in contrast
> to
> > all other callbacks, they are DAG-level rather than task-level. Also, the
> > call signature is poorly defined: for instance, two of the arguments are
> > just strings produced from the other two arguments.
> >
> > I have some thoughts about ways to fix these issues:
> >
> > 1) I just consider this one a bug. If a task instance is skipped, that
> was
> > intentional, and it should not trigger any alerts.
> >
> > 2) I think that the `sla=` parameter should be split into something like
> > this:
> >
> > `expected_start`: Timedelta after execution date, representing when this
> > task must have started by.
> > `expected_finish`: Timedelta after execution date, representing when this
> > task must have finished by.
> > `expected_duration`: Timedelta after task start, representing how long it
> > is expected to run including all retries.
> >
> > This would give better operator control over SLAs, particularly for tasks
> > deeper in larger DAGs where exact ordering may be hard to predict.
> >
> > 3) The emails should be improved to be more operator-friendly, and take
> > into account that someone may get a callback for a DAG they don't know
> very
> > well, or be paged by this notification.
> >
> > 4.1) All Airflow callbacks should support a list, rather than requiring a
> > single function. (I've written a wrapper that does this, but it would be
> > better for Airflow to just handle this itself.)
> >
> > 4.2) SLA miss callbacks should be task callbacks that receive context,
> like
> > all the other callbacks. Having a DAG figure out which tasks have missed
> > SLAs collectively is fine, but getting SLA failures in a batched callback
> > doesn't really make much sense. Per-task callbacks can be fired
> > individually within a batch of failures detected at the same time.
> >
> > 4.3) SLA emails should be the default SLA miss callback function, rather
> > than being hardcoded.
> >
> > Also, overall, the SLA miss logic is very complicated. It's stuffed into
> > one overloaded function that is responsible for checking for SLA misses,
> > creating database objects for them, filtering tasks, selecting emails,
> > rendering, and sending. Refactoring it would be a good maintainability
> win.
> >
> > I am already implementing some of the above in a private branch, but I'd
> be
> > curious to hear community feedback as to which of these suggestions might
> > be desirable upstream. I could have this ready for Airflow 2.0 if there
> is
> > interest beyond my own use case.
> >
>

Re: Improving Airflow SLAs

Posted by Maxime Beauchemin <ma...@gmail.com>.

About de-coupling the SLA management process, I've had conversations in the
direction of renaming the scheduler to "supervisor" to reflect the fact
that it's not just scheduling processes, it does a lot more tasks than just
that, SLA management being one of them.

I still think the default should be to require a single supervisor that
would do all the "supervision" work though. I'm generally against requiring
more types of nodes on the cluster. But perhaps the supervisor could have
switches to be started in modes where it would only do a subset of its
tasks, so that people can run multiple specialized supervisor nodes if they
want to.

For the record, I was thinking that renaming the scheduler to supervisor
would likely happen as we re-write it to enable multiple concurrent
supervisor processes. It turns out that parallelizing the scheduler hasn't
been as critical as I thought it would be originally, especially with the
current multi-process scheduler. Sounds like the community is getting a lot
of mileage out of this current multi-process scheduler.

Max

On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <ji...@optiver.com> wrote:

> I would love to see this proposal gets implemented in airflow.
> In our case duration based SLA makes much more sense and I ended up adding
> a decorator to the execute method in our custom operators.
>
> Best regards,
> Jiening
>
> -----Original Message-----
> From: James Meickle [mailto:jmeickle@quantopian.com]
> Sent: Wednesday 02 May 2018 9:00 PM
> To: dev@airflow.incubator.apache.org
> Subject: Improving Airflow SLAs [External]
>
> At Quantopian we use Airflow to produce artifacts based on the previous
> day's stock market data. These artifacts are required for us to trade on
> today's stock market. Therefore, I've been investing time in improving
> Airflow notifications (such as writing PagerDuty and Slack integrations).
> My attention has turned to Airflow's SLA system, which has some drawbacks
> for our use case:
>
> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> skipped for this execution date will still trigger emails/callbacks. This
> is a huge problem for us because we run almost no tasks on weekends (since
> the stock market isn't open).
>
> 2) Defining SLAs can be awkward because they are relative to the execution
> date instead of the task start time. There's no way to alert if a task runs
> for "more than an hour", for any non-trivial DAG. Instead you can only
> express "more than an hour from execution date".  The financial data we use
> varies in when it arrives, and how long it takes to process (data volume
> changes frequently); we also have tight timelines that make retries
> difficult, so we want to alert an operator while leaving the task running,
> rather than failing and then alerting.
>
> 3) SLA miss emails don't have a subject line containing the instance URL
> (important for us because we run the same DAGs in both staging/production)
> or the execution date they apply to. When opened, they can get hard to read
> for even a moderately sized DAG because they include a flat list of task
> instances that are unsorted (neither alpha nor topo). They are also lacking
> any links back to the Airflow instance.
>
> 4) SLA emails are not callbacks, and can't be turned off (other than either
> removing the SLA or removing the email attribute on the task instance). The
> way that SLA miss callbacks are defined is not intuitive, as in contrast to
> all other callbacks, they are DAG-level rather than task-level. Also, the
> call signature is poorly defined: for instance, two of the arguments are
> just strings produced from the other two arguments.
>
> I have some thoughts about ways to fix these issues:
>
> 1) I just consider this one a bug. If a task instance is skipped, that was
> intentional, and it should not trigger any alerts.
>
> 2) I think that the `sla=` parameter should be split into something like
> this:
>
> `expected_start`: Timedelta after execution date, representing when this
> task must have started by.
> `expected_finish`: Timedelta after execution date, representing when this
> task must have finished by.
> `expected_duration`: Timedelta after task start, representing how long it
> is expected to run including all retries.
>
> This would give better operator control over SLAs, particularly for tasks
> deeper in larger DAGs where exact ordering may be hard to predict.
>
> 3) The emails should be improved to be more operator-friendly, and take
> into account that someone may get a callback for a DAG they don't know very
> well, or be paged by this notification.
>
> 4.1) All Airflow callbacks should support a list, rather than requiring a
> single function. (I've written a wrapper that does this, but it would be
> better for Airflow to just handle this itself.)
>
> 4.2) SLA miss callbacks should be task callbacks that receive context, like
> all the other callbacks. Having a DAG figure out which tasks have missed
> SLAs collectively is fine, but getting SLA failures in a batched callback
> doesn't really make much sense. Per-task callbacks can be fired
> individually within a batch of failures detected at the same time.
>
> 4.3) SLA emails should be the default SLA miss callback function, rather
> than being hardcoded.
>
> Also, overall, the SLA miss logic is very complicated. It's stuffed into
> one overloaded function that is responsible for checking for SLA misses,
> creating database objects for them, filtering tasks, selecting emails,
> rendering, and sending. Refactoring it would be a good maintainability win.
>
> I am already implementing some of the above in a private branch, but I'd be
> curious to hear community feedback as to which of these suggestions might
> be desirable upstream. I could have this ready for Airflow 2.0 if there is
> interest beyond my own use case.
>

RE: Improving Airflow SLAs

Posted by Jiening Wen <ji...@Optiver.com>.

I would love to see this proposal gets implemented in airflow.
In our case duration based SLA makes much more sense and I ended up adding a decorator to the execute method in our custom operators.

Best regards,
Jiening

-----Original Message-----
From: James Meickle [mailto:jmeickle@quantopian.com] 
Sent: Wednesday 02 May 2018 9:00 PM
To: dev@airflow.incubator.apache.org
Subject: Improving Airflow SLAs [External]

At Quantopian we use Airflow to produce artifacts based on the previous
day's stock market data. These artifacts are required for us to trade on
today's stock market. Therefore, I've been investing time in improving
Airflow notifications (such as writing PagerDuty and Slack integrations).
My attention has turned to Airflow's SLA system, which has some drawbacks
for our use case:

1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
skipped for this execution date will still trigger emails/callbacks. This
is a huge problem for us because we run almost no tasks on weekends (since
the stock market isn't open).

2) Defining SLAs can be awkward because they are relative to the execution
date instead of the task start time. There's no way to alert if a task runs
for "more than an hour", for any non-trivial DAG. Instead you can only
express "more than an hour from execution date".  The financial data we use
varies in when it arrives, and how long it takes to process (data volume
changes frequently); we also have tight timelines that make retries
difficult, so we want to alert an operator while leaving the task running,
rather than failing and then alerting.

3) SLA miss emails don't have a subject line containing the instance URL
(important for us because we run the same DAGs in both staging/production)
or the execution date they apply to. When opened, they can get hard to read
for even a moderately sized DAG because they include a flat list of task
instances that are unsorted (neither alpha nor topo). They are also lacking
any links back to the Airflow instance.

4) SLA emails are not callbacks, and can't be turned off (other than either
removing the SLA or removing the email attribute on the task instance). The
way that SLA miss callbacks are defined is not intuitive, as in contrast to
all other callbacks, they are DAG-level rather than task-level. Also, the
call signature is poorly defined: for instance, two of the arguments are
just strings produced from the other two arguments.

I have some thoughts about ways to fix these issues:

1) I just consider this one a bug. If a task instance is skipped, that was
intentional, and it should not trigger any alerts.

2) I think that the `sla=` parameter should be split into something like
this:

`expected_start`: Timedelta after execution date, representing when this
task must have started by.
`expected_finish`: Timedelta after execution date, representing when this
task must have finished by.
`expected_duration`: Timedelta after task start, representing how long it
is expected to run including all retries.

This would give better operator control over SLAs, particularly for tasks
deeper in larger DAGs where exact ordering may be hard to predict.

3) The emails should be improved to be more operator-friendly, and take
into account that someone may get a callback for a DAG they don't know very
well, or be paged by this notification.

4.1) All Airflow callbacks should support a list, rather than requiring a
single function. (I've written a wrapper that does this, but it would be
better for Airflow to just handle this itself.)

4.2) SLA miss callbacks should be task callbacks that receive context, like
all the other callbacks. Having a DAG figure out which tasks have missed
SLAs collectively is fine, but getting SLA failures in a batched callback
doesn't really make much sense. Per-task callbacks can be fired
individually within a batch of failures detected at the same time.

4.3) SLA emails should be the default SLA miss callback function, rather
than being hardcoded.

Also, overall, the SLA miss logic is very complicated. It's stuffed into
one overloaded function that is responsible for checking for SLA misses,
creating database objects for them, filtering tasks, selecting emails,
rendering, and sending. Refactoring it would be a good maintainability win.

I am already implementing some of the above in a private branch, but I'd be
curious to hear community feedback as to which of these suggestions might
be desirable upstream. I could have this ready for Airflow 2.0 if there is
interest beyond my own use case.