You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Andrew Gibbs <a....@gmail.com> on 2022/08/04 20:08:52 UTC

[PROPOSAL] I have a fix for the SLA alerts

Hi everyone,

First time post to the dev list ... please be gentle!

I raised a PR to fix SLA alerts (
https://github.com/apache/airflow/pull/25489) but it's not trivial. Jared
Potiuk asked that I flag it up here where it might get more attention, and
I'm happy to oblige.


*A brief summary of the problem*
To the end-user, adding SLAs means that your system stops processing
changes to the dag files.

*A brief summary of the cause*
Adding a SLA to a task in a dag means SLA callbacks get raised and passed
back to the DagProcessorManager. The callbacks can get created in
relatively large volumes, enough that the manager's queue never empties.
This in turn causes the manager to stop checking the file-system for
changes

*A brief summary of the fix*
The changes are confined to the manager.py file. What I have done is:
1. split the manager's queue into two (a standard and a priority queue)
2. track processing of the dags from disk independently of the queue, so
that we'll rescan even if the queue is not empty.
3. added a config flag that causes the manager to scan stale dags on disk,
even if there are a a lot of priority callbacks.

This means that SLA callbacks are processed and alerts are raised in a
timely fashion, but we continue to periodically scan the file system for
files and process any changes.

*Other notes*
1. First and foremost, if you're interested, please do have a look at the
PR. I have done my best to document it thoroughly. There are new tests too!
2. The goal here is simply to make it so that adding SLAs doesn't kill the
rest of the system. I haven't changed how they're defined, how the system
raises them, or anything else. It's purely a fix to the queue(s) inside the
manager. It's as low touch as I could make it.
3. I do have a *much* simpler fix (one line change), which works, but isn't
perfect, particularly under certain config settings. This change is more
complicated, but I think solves the problem "properly".

That's it. Thanks for reading!

A

Re: [PROPOSAL] I have a fix for the SLA alerts

Posted by Jarek Potiuk <ja...@potiuk.com>.

Been on holidays and slowly catching up - but I think we really need AIP
"Make SLAs better" and direction what we want to do overall. Just PR to
split is not enough, I think we should have a vision where we want to go
with SLAs. I personally once wanted to get rid of them (in their current
form) - but there were strong voices that were against it. So I wonder if
someone (maybe you Andrew) could simply start an AIP - document where
general problem solution with current SLAs is described and outlining our
way out -> short term maybe the PR would be cool first step, but I think we
also need to see how it fits into the target solution. I think there will
be more comment on the actual outline of design proposal. I remember the
description in the PR was - quite long but also it was mostly focused on
this particular solution. What I think we need is really this:

1) problem statement (why current SLA's are bad
2) how long term we would like to solve them
3) What are the proposed solutions - including some alternatives
4) What are the pros/cons

As long as someone presents it as an AIP proposal, it will gain weight.
Sometimes it seems that people's interest is picked that they will see that
someone is serious about the subject and most comment flow when it is clear
that this **might** be put up for vote soon (sometimes it's not enough and
we get some comments during voting which is unfortunate), but writing a doc
draft and announcing it as soon-to-be-AIP proposal is a good idea.

J.


On Mon, Sep 26, 2022 at 11:24 PM Andrew Gibbs <an...@gibbs.io> wrote:

> Evening all,
>
> It's a Monday, which means it's time to talk about SLAs again!
> >>> https://github.com/apache/airflow/pull/25489 <<<
> I let this issue lie fallow a bit because of the push to 2.4.0; Jarek
> pointed out that people's attention was likely to be elsewhere.
>
> However, now that 2.4.0 is out (congrats guys), I'd would love to get some
> movement on this MR. I patched my local install with this change over a
> month ago, and it's been rock solid. SLA's are definitely not perfect (try
> backfilling a dag, and just watch the alerts roll in!) but on a day to day
> basis, it's my experience that this change makes SLA's 100% reliable - they
> fire when they're supposed to, and they don't break the rest of the system.
>
> Oh, and they are soooooo useful. Previously I had some external infra that
> tried to spot when certain tasks hadn't run. I've been able to fully
> decommission that in favour of just letting airflow tell me if something's
> running late. SLAs have actually saved my bacon a couple of times already.
> Everyone should have them!
>
> Basically, what Colin M said - the SLA concept is a good one and I think a
> lot of people would benefit from it, but don't because it's functionally
> broken. This change makes the feature safe to use, and I tihnk that will
> start to have an impact. Personally, I expect that once people can start
> using it without fear of breaking the system, you'll see a few more MRs
> trickle in to start sanding off the rough edges, e.g.
> https://github.com/apache/airflow/issues/22532 might get picked up if
> more people are using it. The classic virtuous feedback loop, basically.
>
> So, what I'm looking for here is:
> 1. A review + sign off of the MR (hashtagstatingtheobvious) - if you have
> questions about the change (how / why / etc.) I will happily answer them.
> 2. Recommendations / requests for things to add to the documentation (e.g.
> I have a callback that sends slack messages I'd happy whack in there) - I
> definitely need someone to tell me what documentations to touch, I've only
> got a newsfragment in there so far.
> 3. Anything else I might have missed.
>
> Thanks,
> Andrew
>
> On Sun, Aug 21, 2022 at 4:23 PM Collin McNulty
> <co...@astronomer.io.invalid> wrote:
>
>> The issue Andrew raised here, that large SLA volumes can totally gunk up
>> Airflow as a whole, is the main reason I do not make a habit of
>> recommending the use of SLA. Fixing this would at least make the feature
>> serviceable and safe, which would be a big step up.
>>
>> Collin
>>
>>
>> On Sun, Aug 21, 2022 at 8:07 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> I think generally that SLA has been broken for quite some time and while
>>> we are continuously saying "we need to fix it" there is no effort to do so.
>>> Not sure if this is because it is unimportant. But also maybe the feature
>>> actually is useful in a number of cases and not broken too much. And maybe
>>> we can fix it in some incremental steps rather than totally rewriting it.
>>>
>>> In this context - this change is supposed to fix one of the problems
>>> with SLA - if there are too many of those (which can be generated if you
>>> have some failures) it actually might get to the point that it will block
>>> dag file processor from doing the "Real" stuff.
>>> Adding priority queue is a good idea I think. And it's is a good first
>>> step. There could be other ideas in place (de-duplicating tha callbacks is
>>> I think already implemented), maybe simply dropping some of the callbacks
>>> eventually. But I think this looks promising to start with that.
>>>
>>> WDYT everyone? Is there anyone else planning to do any serious stuff on
>>> SLA callbacks? Should we take a close look at that one. Mateusz Henc, I
>>> think this might also have some - positive impact on finalizing the AIP-43?
>>>
>>> J.
>>>
>>>
>>> On Thu, Aug 4, 2022 at 10:09 PM Andrew Gibbs <
>>> a.r.gibbs.massmails@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> First time post to the dev list ... please be gentle!
>>>>
>>>> I raised a PR to fix SLA alerts (
>>>> https://github.com/apache/airflow/pull/25489) but it's not trivial.
>>>> Jared Potiuk asked that I flag it up here where it might get more
>>>> attention, and I'm happy to oblige.
>>>>
>>>>
>>>> *A brief summary of the problem*
>>>> To the end-user, adding SLAs means that your system stops processing
>>>> changes to the dag files.
>>>>
>>>> *A brief summary of the cause*
>>>> Adding a SLA to a task in a dag means SLA callbacks get raised and
>>>> passed back to the DagProcessorManager. The callbacks can get created in
>>>> relatively large volumes, enough that the manager's queue never empties.
>>>> This in turn causes the manager to stop checking the file-system for
>>>> changes
>>>>
>>>> *A brief summary of the fix*
>>>> The changes are confined to the manager.py file. What I have done is:
>>>> 1. split the manager's queue into two (a standard and a priority queue)
>>>> 2. track processing of the dags from disk independently of the queue,
>>>> so that we'll rescan even if the queue is not empty.
>>>> 3. added a config flag that causes the manager to scan stale dags on
>>>> disk, even if there are a a lot of priority callbacks.
>>>>
>>>> This means that SLA callbacks are processed and alerts are raised in a
>>>> timely fashion, but we continue to periodically scan the file system for
>>>> files and process any changes.
>>>>
>>>> *Other notes*
>>>> 1. First and foremost, if you're interested, please do have a look at
>>>> the PR. I have done my best to document it thoroughly. There are new tests
>>>> too!
>>>> 2. The goal here is simply to make it so that adding SLAs doesn't kill
>>>> the rest of the system. I haven't changed how they're defined, how the
>>>> system raises them, or anything else. It's purely a fix to the queue(s)
>>>> inside the manager. It's as low touch as I could make it.
>>>> 3. I do have a *much* simpler fix (one line change), which works, but
>>>> isn't perfect, particularly under certain config settings. This change is
>>>> more complicated, but I think solves the problem "properly".
>>>>
>>>> That's it. Thanks for reading!
>>>>
>>>> A
>>>>
>>>> --
>>
>> Collin McNulty
>> Lead Airflow Engineer
>>
>> Email: collin@astronomer.io <jo...@astronomer.io>
>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>
>>
>> <https://www.astronomer.io/>
>>
>

Re: [PROPOSAL] I have a fix for the SLA alerts

Posted by Andrew Gibbs <an...@gibbs.io>.

Evening all,

It's a Monday, which means it's time to talk about SLAs again!
>>> https://github.com/apache/airflow/pull/25489 <<<
I let this issue lie fallow a bit because of the push to 2.4.0; Jarek
pointed out that people's attention was likely to be elsewhere.

However, now that 2.4.0 is out (congrats guys), I'd would love to get some
movement on this MR. I patched my local install with this change over a
month ago, and it's been rock solid. SLA's are definitely not perfect (try
backfilling a dag, and just watch the alerts roll in!) but on a day to day
basis, it's my experience that this change makes SLA's 100% reliable - they
fire when they're supposed to, and they don't break the rest of the system.

Oh, and they are soooooo useful. Previously I had some external infra that
tried to spot when certain tasks hadn't run. I've been able to fully
decommission that in favour of just letting airflow tell me if something's
running late. SLAs have actually saved my bacon a couple of times already.
Everyone should have them!

Basically, what Colin M said - the SLA concept is a good one and I think a
lot of people would benefit from it, but don't because it's functionally
broken. This change makes the feature safe to use, and I tihnk that will
start to have an impact. Personally, I expect that once people can start
using it without fear of breaking the system, you'll see a few more MRs
trickle in to start sanding off the rough edges, e.g.
https://github.com/apache/airflow/issues/22532 might get picked up if more
people are using it. The classic virtuous feedback loop, basically.

So, what I'm looking for here is:
1. A review + sign off of the MR (hashtagstatingtheobvious) - if you have
questions about the change (how / why / etc.) I will happily answer them.
2. Recommendations / requests for things to add to the documentation (e.g.
I have a callback that sends slack messages I'd happy whack in there) - I
definitely need someone to tell me what documentations to touch, I've only
got a newsfragment in there so far.
3. Anything else I might have missed.

Thanks,
Andrew

On Sun, Aug 21, 2022 at 4:23 PM Collin McNulty <co...@astronomer.io.invalid>
wrote:

> The issue Andrew raised here, that large SLA volumes can totally gunk up
> Airflow as a whole, is the main reason I do not make a habit of
> recommending the use of SLA. Fixing this would at least make the feature
> serviceable and safe, which would be a big step up.
>
> Collin
>
>
> On Sun, Aug 21, 2022 at 8:07 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> I think generally that SLA has been broken for quite some time and while
>> we are continuously saying "we need to fix it" there is no effort to do so.
>> Not sure if this is because it is unimportant. But also maybe the feature
>> actually is useful in a number of cases and not broken too much. And maybe
>> we can fix it in some incremental steps rather than totally rewriting it.
>>
>> In this context - this change is supposed to fix one of the problems with
>> SLA - if there are too many of those (which can be generated if you have
>> some failures) it actually might get to the point that it will block dag
>> file processor from doing the "Real" stuff.
>> Adding priority queue is a good idea I think. And it's is a good first
>> step. There could be other ideas in place (de-duplicating tha callbacks is
>> I think already implemented), maybe simply dropping some of the callbacks
>> eventually. But I think this looks promising to start with that.
>>
>> WDYT everyone? Is there anyone else planning to do any serious stuff on
>> SLA callbacks? Should we take a close look at that one. Mateusz Henc, I
>> think this might also have some - positive impact on finalizing the AIP-43?
>>
>> J.
>>
>>
>> On Thu, Aug 4, 2022 at 10:09 PM Andrew Gibbs <
>> a.r.gibbs.massmails@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> First time post to the dev list ... please be gentle!
>>>
>>> I raised a PR to fix SLA alerts (
>>> https://github.com/apache/airflow/pull/25489) but it's not trivial.
>>> Jared Potiuk asked that I flag it up here where it might get more
>>> attention, and I'm happy to oblige.
>>>
>>>
>>> *A brief summary of the problem*
>>> To the end-user, adding SLAs means that your system stops processing
>>> changes to the dag files.
>>>
>>> *A brief summary of the cause*
>>> Adding a SLA to a task in a dag means SLA callbacks get raised and
>>> passed back to the DagProcessorManager. The callbacks can get created in
>>> relatively large volumes, enough that the manager's queue never empties.
>>> This in turn causes the manager to stop checking the file-system for
>>> changes
>>>
>>> *A brief summary of the fix*
>>> The changes are confined to the manager.py file. What I have done is:
>>> 1. split the manager's queue into two (a standard and a priority queue)
>>> 2. track processing of the dags from disk independently of the queue, so
>>> that we'll rescan even if the queue is not empty.
>>> 3. added a config flag that causes the manager to scan stale dags on
>>> disk, even if there are a a lot of priority callbacks.
>>>
>>> This means that SLA callbacks are processed and alerts are raised in a
>>> timely fashion, but we continue to periodically scan the file system for
>>> files and process any changes.
>>>
>>> *Other notes*
>>> 1. First and foremost, if you're interested, please do have a look at
>>> the PR. I have done my best to document it thoroughly. There are new tests
>>> too!
>>> 2. The goal here is simply to make it so that adding SLAs doesn't kill
>>> the rest of the system. I haven't changed how they're defined, how the
>>> system raises them, or anything else. It's purely a fix to the queue(s)
>>> inside the manager. It's as low touch as I could make it.
>>> 3. I do have a *much* simpler fix (one line change), which works, but
>>> isn't perfect, particularly under certain config settings. This change is
>>> more complicated, but I think solves the problem "properly".
>>>
>>> That's it. Thanks for reading!
>>>
>>> A
>>>
>>> --
>
> Collin McNulty
> Lead Airflow Engineer
>
> Email: collin@astronomer.io <jo...@astronomer.io>
> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>
>
> <https://www.astronomer.io/>
>

Re: [PROPOSAL] I have a fix for the SLA alerts

Posted by Collin McNulty <co...@astronomer.io.INVALID>.

The issue Andrew raised here, that large SLA volumes can totally gunk up
Airflow as a whole, is the main reason I do not make a habit of
recommending the use of SLA. Fixing this would at least make the feature
serviceable and safe, which would be a big step up.

Collin


On Sun, Aug 21, 2022 at 8:07 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> I think generally that SLA has been broken for quite some time and while
> we are continuously saying "we need to fix it" there is no effort to do so.
> Not sure if this is because it is unimportant. But also maybe the feature
> actually is useful in a number of cases and not broken too much. And maybe
> we can fix it in some incremental steps rather than totally rewriting it.
>
> In this context - this change is supposed to fix one of the problems with
> SLA - if there are too many of those (which can be generated if you have
> some failures) it actually might get to the point that it will block dag
> file processor from doing the "Real" stuff.
> Adding priority queue is a good idea I think. And it's is a good first
> step. There could be other ideas in place (de-duplicating tha callbacks is
> I think already implemented), maybe simply dropping some of the callbacks
> eventually. But I think this looks promising to start with that.
>
> WDYT everyone? Is there anyone else planning to do any serious stuff on
> SLA callbacks? Should we take a close look at that one. Mateusz Henc, I
> think this might also have some - positive impact on finalizing the AIP-43?
>
> J.
>
>
> On Thu, Aug 4, 2022 at 10:09 PM Andrew Gibbs <
> a.r.gibbs.massmails@gmail.com> wrote:
>
>> Hi everyone,
>>
>> First time post to the dev list ... please be gentle!
>>
>> I raised a PR to fix SLA alerts (
>> https://github.com/apache/airflow/pull/25489) but it's not trivial.
>> Jared Potiuk asked that I flag it up here where it might get more
>> attention, and I'm happy to oblige.
>>
>>
>> *A brief summary of the problem*
>> To the end-user, adding SLAs means that your system stops processing
>> changes to the dag files.
>>
>> *A brief summary of the cause*
>> Adding a SLA to a task in a dag means SLA callbacks get raised and passed
>> back to the DagProcessorManager. The callbacks can get created in
>> relatively large volumes, enough that the manager's queue never empties.
>> This in turn causes the manager to stop checking the file-system for
>> changes
>>
>> *A brief summary of the fix*
>> The changes are confined to the manager.py file. What I have done is:
>> 1. split the manager's queue into two (a standard and a priority queue)
>> 2. track processing of the dags from disk independently of the queue, so
>> that we'll rescan even if the queue is not empty.
>> 3. added a config flag that causes the manager to scan stale dags on
>> disk, even if there are a a lot of priority callbacks.
>>
>> This means that SLA callbacks are processed and alerts are raised in a
>> timely fashion, but we continue to periodically scan the file system for
>> files and process any changes.
>>
>> *Other notes*
>> 1. First and foremost, if you're interested, please do have a look at the
>> PR. I have done my best to document it thoroughly. There are new tests too!
>> 2. The goal here is simply to make it so that adding SLAs doesn't kill
>> the rest of the system. I haven't changed how they're defined, how the
>> system raises them, or anything else. It's purely a fix to the queue(s)
>> inside the manager. It's as low touch as I could make it.
>> 3. I do have a *much* simpler fix (one line change), which works, but
>> isn't perfect, particularly under certain config settings. This change is
>> more complicated, but I think solves the problem "properly".
>>
>> That's it. Thanks for reading!
>>
>> A
>>
>> --

Collin McNulty
Lead Airflow Engineer

Email: collin@astronomer.io <jo...@astronomer.io>
Time zone: US Central (CST UTC-6 / CDT UTC-5)


<https://www.astronomer.io/>

Re: [PROPOSAL] I have a fix for the SLA alerts

Posted by Jarek Potiuk <ja...@potiuk.com>.

I think generally that SLA has been broken for quite some time and while we
are continuously saying "we need to fix it" there is no effort to do so.
Not sure if this is because it is unimportant. But also maybe the feature
actually is useful in a number of cases and not broken too much. And maybe
we can fix it in some incremental steps rather than totally rewriting it.

In this context - this change is supposed to fix one of the problems with
SLA - if there are too many of those (which can be generated if you have
some failures) it actually might get to the point that it will block dag
file processor from doing the "Real" stuff.
Adding priority queue is a good idea I think. And it's is a good first
step. There could be other ideas in place (de-duplicating tha callbacks is
I think already implemented), maybe simply dropping some of the callbacks
eventually. But I think this looks promising to start with that.

WDYT everyone? Is there anyone else planning to do any serious stuff on SLA
callbacks? Should we take a close look at that one. Mateusz Henc, I think
this might also have some - positive impact on finalizing the AIP-43?

J.

On Thu, Aug 4, 2022 at 10:09 PM Andrew Gibbs <a....@gmail.com>
wrote:

> Hi everyone,
>
> First time post to the dev list ... please be gentle!
>
> I raised a PR to fix SLA alerts (
> https://github.com/apache/airflow/pull/25489) but it's not trivial. Jared
> Potiuk asked that I flag it up here where it might get more attention, and
> I'm happy to oblige.
>
>
> *A brief summary of the problem*
> To the end-user, adding SLAs means that your system stops processing
> changes to the dag files.
>
> *A brief summary of the cause*
> Adding a SLA to a task in a dag means SLA callbacks get raised and passed
> back to the DagProcessorManager. The callbacks can get created in
> relatively large volumes, enough that the manager's queue never empties.
> This in turn causes the manager to stop checking the file-system for
> changes
>
> *A brief summary of the fix*
> The changes are confined to the manager.py file. What I have done is:
> 1. split the manager's queue into two (a standard and a priority queue)
> 2. track processing of the dags from disk independently of the queue, so
> that we'll rescan even if the queue is not empty.
> 3. added a config flag that causes the manager to scan stale dags on disk,
> even if there are a a lot of priority callbacks.
>
> This means that SLA callbacks are processed and alerts are raised in a
> timely fashion, but we continue to periodically scan the file system for
> files and process any changes.
>
> *Other notes*
> 1. First and foremost, if you're interested, please do have a look at the
> PR. I have done my best to document it thoroughly. There are new tests too!
> 2. The goal here is simply to make it so that adding SLAs doesn't kill the
> rest of the system. I haven't changed how they're defined, how the system
> raises them, or anything else. It's purely a fix to the queue(s) inside the
> manager. It's as low touch as I could make it.
> 3. I do have a *much* simpler fix (one line change), which works, but
> isn't perfect, particularly under certain config settings. This change is
> more complicated, but I think solves the problem "properly".
>
> That's it. Thanks for reading!
>
> A
>
>