You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/09/11 16:37:46 UTC

[GitHub] [airflow] Eronarn opened a new pull request #3584: [AIRFLOW-249] Refactor the SLA mechanism

Eronarn opened a new pull request #3584: [AIRFLOW-249] Refactor the SLA mechanism
URL: https://github.com/apache/airflow/pull/3584

### JIRA

AIRFLOW-249 Refactor the SLA mechanism

This is a fairly large patch that should and/or may also address:

AIRFLOW-1360 SLA miss goes to all emails in DAG
AIRFLOW-2236 Airflow SLA is triggered for all backfilled tasks
AIRFLOW-557 SLA notification does not work for manually triggered DAGs
AIRFLOW-133 SLAs don't seem to work with schedule_interval=None
AIRFLOW-1013 airflow/jobs.py:manage_slas() exception for @once dag

### Description

At Quantopian we use Airflow to produce artifacts based on the previous day's stock market data. These artifacts are required for us to trade on today's stock market. Therefore, I've been investing time in improving Airflow notifications (such as writing PagerDuty and Slack integrations). My attention has turned to Airflow's SLA system, which has some drawbacks for our use case:

1) Defining SLAs can be awkward because they are relative to the execution date instead of the task start time. There's no way to alert if a task runs for "more than an hour", for any non-trivial DAG. Instead you can only express "more than an hour from execution date". The financial data we use varies in when it arrives, and how long it takes to process (data volume changes frequently). We also run DAGs with mixed UTC and Eastern events, making Airflow SLA definitions depend on time of year.

2) Execution timeouts can capture "more than an hour" but do not serve the same purpose as SLAs. We have tight timelines on long-duration tasks that make retries difficult, so we want to alert an operator while leaving the original task running, rather than failing and then alerting.

3) The way that SLA miss callbacks are defined is not intuitive, as in contrast to other callbacks, they are defined on the DAG rather than on the task. This has lead to a lot of confusion in JIRA/the mailing list; many people think that SLAs are for DAG completion, when in reality it's a DAG-level attribute that handles batched task-level completions. Also, the call signature is poorly defined: for instance, two of the arguments are just strings produced from the other two arguments.

4) SLA miss emails don't include any links back to the Airflow instance (important for us because we run the same DAGs in both staging/production) or the execution date they apply to. When opened, they be hard to read for even a moderately sized DAG because they include a flat list of task instances that are unsorted (neither alpha nor topo).

5) SLA miss emails are sent to every email address associated with the DAG. This can lead to inadvertent paging of users associated with unrelated "forks" in the DAG from where the SLA miss failed.

6) SLA emails are not callbacks, and can't be turned off (other than either removing the SLA or removing the email attribute on the task instance).

This patch attempts to address the above issues by making some of the following changes:

1) The `sla=` parameter is split into:
`expected_start`: Timedelta after execution date, representing when this task must have started by.
`expected_finish`: Timedelta after execution date, representing when this task must have finished by.
`expected_duration`: Timedelta after task start, representing how long this task is expected to run, including all retries.

These parameters are set on a task (or DAG-level default args), and a task can have any combination of them, though there is some basic validation logic to warn you if you try to set an illogical combination. The SlaMiss object stores the type of SLA miss as a new database field, which is a component of the primary key.

There is logic to convert the existing `sla=` parameter to `expected_finish` (as well as a migration), since that's the closest parallel, so it should be relatively backwards compatible.

2) SLA misses are no longer grouped for checks or callbacks. While there have always been independent per-task SlaMiss objects, there was a lot of logic to poll for all SlaMisses that occurred at the same time, and to batch them into a single email.

3) As a consequence of 2), The `sla_miss_callback` is no longer set on the DAG level, which has been confusing. It now has a context-based signature to be consistent with other task callbacks. This change is not backwards compatible for anyone using custom SLA miss callbacks, but should be a fairly straightforward conversion.

4) The SLA miss email is now the default SLA miss callback on tasks. Previously it was an additional non-overrideable feature.

5) The SLA miss email has some improvements:
- Only one SLA miss per email
- SLA-miss-specific title
- Includes a link to the task instance
- Only includes potentially-blocked downstreams
- Sends to a list of "interested subscribers", which is defined as all email addresses on tasks downstream of the task that missed its SLA.
- Additional ASCII art to help distinguish emails.

6) Move the SLA miss code largely out of the Scheduler code: some into models (DAGs manage their own SlaMiss objects), and some into SLA helper functions.

7) Overall, attempt to reduce the complexity and lack of documentation of the SLA miss logic, given the constraint that the new implementation is a larger feature and more lines of code. The previous implementation was stuffed into one overloaded function that is responsible for checking for SLA misses, creating database objects for them, filtering tasks, selecting emails, rendering, and sending. These are now broken into multiple functions, which attempt to be more single-purpose.

### Tests
I have rewritten the existing tests to match this patch; I am open to adding any required additional tests. (The tests probably also want to be moved out of the scheduler suite.)

### Commits
- [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"

### Documentation
- [x] In case of new functionality, my PR adds documentation that describes how to use it.

(Sort of - I still need to rewrite the high-level, on-website SLA documentation if this gets approved.)

### Code Quality
- [x] Passes `git diff upstream/master -u -- "*.py" | flake8 --diff`

(except for a part I did not touch, and the core alembic migration template generating flake8 errors...)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services