You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/17 07:24:21 UTC

[GitHub] [airflow] thejens opened a new issue #18317: Better Backfill User Experience

thejens opened a new issue #18317:
URL: https://github.com/apache/airflow/issues/18317

### Description

I don't believe backfilling (of data) is not well handled in Airflow at the moment.

I believe the current backfill CLI command should have a UI component on the webserver - not the least as many deployments of Airflow doesn't expose a CLI interface to the users - only to admins.

I would also like it to handle the following cases:

1. Backfill a DAG for execution dates where it has already run.
- This essentially is the same as clearing all DAG-runs between those dates. The current UI forces me to manually clear the runs, I can toggle "Future" or "Past", but nothing like a "Range Between".

2. Backfill a TASK for execution dates where it has already run
- This is essentially the same as above, except instead of clearing the whole DAG, it would clear a task, and potentially all downstreams from that task. A common use case here is when a new task is added to an existing DAG and you want to re-trigger it for historical dates. Currently the new task will have no execution, but the DAG it was created into will have successful runs for those dates.

3. The above, including execution dates where it has not run
- In both the above examples, specifying a start_date before pre-existing DAG runs should insert those DAG runs and execute them. It should also insert any missing dag-runs between the dates where it has already run, in case those have gone missing.

I'd be happy to provide some of the functionality in a PR, but I am not a skilled frontend developer.

### Use case/motivation

Altering a DAG or task is a common use-case, for instance if an improvement to some business logic has been made - or a bug has been found and fixed. Restoring the status of those produced datasets easily is then important.

Here's both the case where you want to re-run DAG executions, and insert new DAG executions for historical dates that may have been cleaned up to save space in the DB.

### Related issues

I am certain this has been raised and discussed in the past. It is the number one feature I miss from working with Luigi - where I could easily re-trigger historical task-runs.

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] kimyen edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

kimyen edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-923415688


   We have implemented 2 versions of backfill tools for our users. I want to share them here and suggest another version that may be more general to add to Airflow.
   
   ### Backfill Queue
   
   In this version, we were using CeleryExecutor.
   
   #### User experience
   
   Users go to a UI page to trigger a backfill. They would submit a form specifying the dag, the list of tasks, the date range, and  some flags (run backwards, mark success, ...)
   
   After the submission is accepted, the user go to another page to watch the status of the backfill (queued, running with % of success, and list of "aborted" backfill dues to un-recoverable errors or backfill deadlock.) From this page, the author of the backfill can also abort the backfill that is in queued or is running or retry a backfill from where it fails.
   
   #### Extra components added
   
   - Redis queues to hold the queued backfill submissions, running backfill chunks, and aborted backfills.
   - Backfill UI plugins which renders the form and the status page.
   - Backfill API which handle read and write to the Redis queue, used by the UI component.
   - Backfill "worker" which dequeue and run the backfill (via Python) with timeout and retry.
   
   #### Disadvantages
   
   - Only runs 1 backfill at a time, many backfill stuck in queue
   - When backfill is aborted, all running DAG runs remains in `running` state and still require a manual action
   
   ### Backfill on demand
   
   In this version, we use the KubernetesExecutor.
   
   #### User experience
   
   User issues a backfill (in our case, it is in form of a chatops command, which is a http request to the chatops server. The chatops server have access to our K8s cluster, and bring up a pod that runs backfill.) 
   
   User can get logs and abort the backfill via different chatops command (chatops server APIs).
   
   #### Extra components added
   
   - K8s Backfill template wrapped by `if backfill enabled` (default to `False`)
   - chatops server (which we already have to handle deploying DAG changes)
   - Cron job that clean up completed backfill pod (success or fail)
   
   #### Disadvantages
   
   - When backfill is aborted, all on-going DAG runs remains in `running` state and still require a manual action
   
   ### Proposed solution
   
   #### User experience
   
   User goes to the DAG page and trigger a backfill with data range and other backfill flags
   
   #### Extra components
   
   - K8s Backfill template as described above. The UI would be using this template to bring up a new pod that runs backfill based on the user's input
   - UI pages that allows user to select data range and flags
   - UI pages that show status of the backfill or error messages from the backfill pod
   - Cron job that clean up completed backfill pod (success or fail)
   
   
   **_I have wanted to share our ideas and hope to convert with Airflow's implementation so that we do not need to manage our own patch. We are happy to see this issue being assigned. Please let me know if there are more information that I can share to help pushing this further. Otherwise, we will be happy with whichever implementation Airflow chose to support better backfill experience._** 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] kimyen commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

kimyen commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-923415688


   We have implemented 2 versions of backfill tools for our users. I want to share them here and suggest another version that may be more general to add to Airflow.
   
   ### Backfill Queue
   
   In this version, we were using CeleryExecutor.
   
   #### User experience
   
   Users go to a UI page to trigger a backfill. They would submit a form specifying the dag, the list of tasks, the date range, and  some flags (run backwards, mark success, ...)
   
   After the submission is accepted, the user go to another page to watch the status of the backfill (queued, running with % of success, and list of "aborted" backfill dues to un-recoverable errors or backfill deadlock.) From this page, the author of the backfill can also abort the backfill that is in queued or is running or retry a backfill from where it fails.
   
   #### Extra components added
   
   - Redis queues to hold the queued backfill submissions, running backfill chunks, and aborted backfills.
   - Backfill UI plugins which renders the form and the status page.
   - Backfill API which handle read and write to the Redis queue, used by the UI component.
   - Backfill "worker" which dequeue and run the backfill (via Python) with timeout and retry.
   
   #### Disadvantages
   
   - Only runs 1 backfill at a time, many backfill stuck in queue
   - When backfill is aborted, all running DAG runs remains in `running` state and still require a manual action
   
   ### Backfill on demand
   
   In this version, we use the KubernetesExecutor.
   
   #### User experience
   
   User issues a backfill (in our case, it is in form of a chatops command, which is a http request to the chatops server. The chatops server have access to our K8s cluster, and bring up a pod that runs backfill.) 
   
   User can get logs and abort the backfill via different chatops command (chatops server APIs).
   
   #### Extra components added
   
   - K8s Backfill template wrapped by `if backfill enabled` (default to `False`)
   - chatops server (which we already have to handle deploying DAG changes)
   
   #### Disadvantages
   
   - When backfill is aborted, all on-going DAG runs remains in `running` state and still require a manual action
   
   ### Proposed solution
   
   #### User experience
   
   User goes to the DAG page and trigger a backfill with data range and other backfill flags
   
   #### Extra components
   
   - K8s Backfill template as described above. The UI would be using this template to bring up a new pod that runs backfill based on the user's input
   - UI pages that allows user to select data range and flags
   - UI pages that show status of the backfill or error messages from the backfill pod
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-921840479

Just one comment on that (and a bit of warning) and possibly an explanation to your @thejens surprise and disbelief (which is probably coming from not understanding the full scope of that task and impact it has on the distributed Airflow architecture).

The UI backfill requires a bit more than "simple implementation". This has been discussed several times at the devlist and the problem here is not the UI but "control plane". When you use the CLI, the admin user fully controls and manages the terminal, all the errors and potentially long running process the backfill might be., If you backfill a lot of data, it can take a lot of time and backfill generaly works in the way that it will sequentially run historical runs.

Currently, there is no "long running" process in the Wab UI. all what webserver runs are gunicorn processes, that are restarted periodically and none of the worker processes survive across a page refresh. Webserver is stateless and keep all the state in database, so it can (and will be) restarted at any time.

Backfill is entirely different thing. It has to run sometimes for hours and actively monitor the backfilled DAGs/tasks, react to failures etc. So running that from the webserver is not the best idea - ideally this should be another component in scheduler or a separate component like triggerer (coming in 2.2) to run the backfill command and the UI should at most be used to trigger it, and display the status. Taking into account that backfill is an "afterthought" - not something that is and should be done on a regular basis - having a separate component to serve that case (where there is a CLI for ad-hoc operations) is not the highest priority.

So in short this task is more of a backend/architecture change than UI , and it's quite a complex piece I think especially if you want to make sure that it works with multiple schedulers for example.

And yeah - I think it's a useful one, though IMHO it's not "critical" and weighting implementation complexity with the "value" of it (where you have CLI backfil) - it's not at all a surprising for me we do not have it yet.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-924093846


   > I don't see the need for a dedicated backfill process to run, the scheduler could take care of that I believe, if tasks and dags are idempotent they don't even need to care about execution order, if order matters I guess `depends_on_past` should be set on the tasks(?) and the scheduler should handle it(?)
   
   Not currently as far as I understand how Scheduler works. The Scheduler currently is DAG based,  not individual task based. It looks at the DAGs and task dependencies for the "future" runs, schedules and executes them. There is no way (as I understand how scheduler runs) to get it start, monitor, send for exacution and overlook to completion selected tasks from selected dag from the past. The current architecture is that scheduler only looks ahead (possibly starting from the past if the dag has never been run) at the DAGs and determines which are the next tasks should be run for it and sends them to executors to execute - but there is no past scheduling for selected tasks). The database queries, scheduler loop, selecting which tasks to run next and when are heavily optimized for that use case and you would not be able to use it for re-running tasks without pretty much complete overhaul.
   
   But maybe I am wrong, and do not understand well enough how scheduler works. I am happy to get corrected if I am wrong here - would love to hear from others who understand better how scheduler work.
   
   From what I know this will likely change in the future, when Scheduler will become more "task based" (this is planned and will likely be implemented in 2.3 or 2.4) and once this is done, the behaviour you describe will be possible, but it's quite a big effort and changing behaviour of scheduler, as well as allowing DAG versioning, and this is yet another reason why implementing backfill now is basically a lost effort as it will have to be re-implemented. So either we implement it as a "tactical" solution now quicklly - with the management of backfill process separately from scheduler - with limited effort and reusing a code that others developed (see @kimyen) or we wait with that until the task-based scheduler becomes a reality an reconsider it then IMHO.
   
   I uderstand it's important for you to run backfill, but certainly the "afterthought"  for me is that this is something you anyway have to trigger and overlook manually, and it something that is usually managed and run by a very small number of users who have special permission and access usually and not something that is needed by all the people who write and observe the DAGS. The audience here is far smaller and this is yet another justification that CLI is "good enough" for now I think. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] kimyen edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

kimyen edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-923415688

We have implemented 2 versions of backfill tools for our users. I want to share them here and suggest another version that may be more general to add to Airflow.

### Backfill Queue

In this version, we were using CeleryExecutor.

#### User experience

Users go to a UI page to trigger a backfill. They would submit a form specifying the dag, the list of tasks, the date range, and some flags (run backwards, mark success, ...)

After the submission is accepted, the user go to another page to watch the status of the backfill (queued, running with % of success, and list of "aborted" backfill dues to un-recoverable errors or backfill deadlock.) From this page, the author of the backfill can also abort the backfill that is in queued or is running or retry a backfill from where it fails.

#### Extra components added

- Redis queues to hold the queued backfill submissions, running backfill chunks, and aborted backfills.
- Backfill UI plugins which renders the form and the status page.
- Backfill API which handle read and write to the Redis queue, used by the UI component.
- Backfill "worker" which dequeue and run the backfill (via Python) with timeout and retry.

#### Disadvantages

- Only runs 1 backfill at a time, many backfill stuck in queue
- When backfill is aborted, all running DAG runs remains in `running` state and still require a manual action

### Backfill on demand

In this version, we use the KubernetesExecutor.

#### User experience

User issues a backfill (in our case, it is in form of a chatops command, which is a http request to the chatops server. The chatops server have access to our K8s cluster, and bring up a pod that runs backfill.)

User can get logs and abort the backfill via different chatops command (chatops server APIs).

#### Extra components added

- K8s Backfill template wrapped by `if backfill enabled` (default to `False`)
- chatops server (which we already have to handle deploying DAG changes)

#### Disadvantages

- When backfill is aborted, all on-going DAG runs remains in `running` state and still require a manual action

### Proposed solution

#### User experience

User goes to the DAG page and trigger a backfill with data range and other backfill flags

#### Extra components

- K8s Backfill template as described above. The UI would be using this template to bring up a new pod that runs backfill based on the user's input
- UI pages that allows user to select data range and flags
- UI pages that show status of the backfill or error messages from the backfill pod

**_I have wanted to share our ideas and hope to convert with Airflow's implementation so that we do not need to manage our own patch. We are happy to see this issue being assigned. Please let me know if there are more information that I can share to help pushing this further. Otherwise, we will be happy with whichever implementation Airflow chose to support better backfill experience._**

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-924093846


   > I don't see the need for a dedicated backfill process to run, the scheduler could take care of that I believe, if tasks and dags are idempotent they don't even need to care about execution order, if order matters I guess `depends_on_past` should be set on the tasks(?) and the scheduler should handle it(?)
   
   Not currently as far as I understand how Scheduler works currently. The Scheduler currently is DAG based,  not individual task based. It looks at the DAGs and task dependencies for the "future" runs, schedules and executes them. There is no way (as I understand how scheduler runs) to get it start, monitor, send for exacution and overlook to completion selected tasks from selected dag from the past. The current architecture is that scheduler only looks ahead (possibly starting from the past if the dag has never been run) at the DAGs and determines which are the next tasks should be run for it and sends them to executors to execute - but there is no past scheduling for selected tasks). The database queries, scheduler loop, selecting which tasks to run next and when are heavily optimized for that use case and you would not be able to use it for re-running tasks without pretty much complete overhaul.
   
   But maybe I am wrong, and do not understand well enough how scheduler works. I am happy to get corrected if I am wrong here - would love to hear from others who understand better how scheduler work.
   
   From what I know this will likely change in the future, when Scheduler will become more "task based" (this is planned and will likely be implemented in 2.3 or 2.4) and once this is done, the behaviour you describe will be possible, but it's quite a big effort and changing behaviour of scheduler, as well as allowing DAG versioning, and this is yet another reason why implementing backfill now is basically a lost effort as it will have to be re-implemented. So either we implement it as a "tactical" solution now quicklly - with the management of backfill process separately from scheduler - with limited effort and reusing a code that others developed (see @kimyen) or we wait with that until the task-based scheduler becomes a reality an reconsider it then IMHO.
   
   I uderstand it's important for you to run backfill, but certainly the "afterthought"  for me is that this is something you anyway have to trigger and overlook manually, and it something that is usually managed and run by a very small number of users who have special permission and access usually and not something that is needed by all the people who write and observe the DAGS. The audience here is far smaller and this is yet another justification that CLI is "good enough" for now I think. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] thejens commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

thejens commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-924643771


   Hi, I guess the fact the manual for how to backfill being 4 pages long if printed speaks to the need for a better UX. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] kimyen commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

kimyen commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-924377251

> What I'm after is a way to insert multiple dag-runs for historical dates in bulk from the UI, possibly with some tasks already marked as complete/skipped, as well as clearing tasks/dagruns between certain dates.

@thejens without knowing the explicit use case, seeing all the things you listed here prompt me to this doc I wrote. I think that there are already multiple ways to "backfill" using the scheduler. See Manual Backfill > Airflow builtin options.

*Notes that the "backfill request view" in the below doc is for the first backfill tool I mentioned above.*

# How to backfill your DAG

## Background

There are multiple ways to backfill a DAG in Airflow. We will attempt to describe when to use each option.

### Use case 1: New DAG

A new DAG is created on April 4th, 2020 and we want the DAG to start collecting data since March 1st, 2020.
To achieve this, while writing the DAG definition, we can set `catchup=True` and `"start_date": datetime(2020, 3, 1)` in the DAG's `default_args`.

When the code is deployed to production, the backfill (from March 1st to current time) is automatically started by the scheduler.

### Use case 2: Extend DAG runs further in the past

An existing DAG has DAG runs starting from March 1st, 2020. We want to extend it to January 1st, 2020. We can achieve this by:
- ensuring that the DAG has `catchup=True`, and
- change the start date to January 1st, 2020

When the code is deployed to production, the backfill (from January 1st to March 1st) is automatically started by the scheduler.
**If there are any successful DAG runs after the start date, Airflow is not going to `catchup`. See [Start from Dag Runs view](#start-from-dag-runs-view) to delete the successful DAG run and trigger the catchup process.**

This can also be achieved by manually backfilling the DAG from January 1st to March 1st. See [manual backfill section](#manual-backfill).

### Use case 3: DAG logic change

An existing DAG was used to calculated some metrics. However, the calculations need to be updated, and all past successful DAG runs need to be rerun to update the resulting data for those days.
For example, the change was deployed to production on May 1st, 2020. The May 1st DAG run is then scheduled to run and uses the new DAG logic. However, all prior DAG runs from January 1st, 2020 to April 30th, 2020 need to be re-run.

In this case, a manual backfill needs to be triggered. See [manual backfill section](#manual-backfill).

### Use case 4: New Task

A new task is added to an existing DAG. Regardless if your DAG has `catchup=True`, since the existing DAG runs have been completed, the scheduler will not automatically trigger backfill runs for the new task.

In this case, a manual backfill needs to be triggered. See [manual backfill section](#manual-backfill).

## Manual backfill

### Understand backfill and scheduler

[This page](https://airflow.apache.org/docs/stable/scheduler.html) describes how the Airflow scheduler, catchup, backfill, and external triggers works. We highly recommend that you understand these concepts before performing a manual backfill.

### Things to check before starting a manual backfill

Answer to these questions will also help to choose a more fit option to perform backfill.

- What: Backfill the entire DAG or a specific task(s)
- When: Time range you want to run the backfill
- Pre-condition: Are there dependencies for your DAG/task? If so, have the dependencies been met or the pre-conditions been satisfied?
- How:
- Are there existing DAG runs for the period you want to backfill? If so, do you want to re-run or skip them?
- If you want to backfill a specific task, can the upstream task(s) be ignored?
- Impact:
- What data does this backfill change?
- Would it result in duplicated data?

### Airflow built in options

Airflow has multiple built in options to trigger a backfill manually.

#### Start from Tree view

The tree view option is best fit if you only want to backfill a handful of specific tasks.

**Only use this option if the task's dependencies has been met.**

Go to your DAG's tree view:

- click on the task you want to backfill
- On the line with the "Run" button, click "Ignore Task State"
- Click "Run"

#### Start from Task Instance view

The Task Intance view option is the best fit if you want to backfill more than a handful of specific tasks. (When clicking on each task and start running them manually is taking too much time.)

**Only use this option if:**
- your DAG is configured with `catchup=True`,
- the tasks' dependencies has been met,
- the backfill period is covered by the DAG's `start_date` - `end_date` range.
- there are existing task runs for the task(s) you want to backfill

This option is accomplished by clearing out the task instances for the task run(s) you want to backfill. The scheduler will schedule new task runs to fill in the ones that have been cleared out.

Let's imagine we want to backfill `fill_meu_v2_retention` from `key_metrics_cube` DAG between `2019-10-01` and `2019-10-10`. Here are the steps to carry out this option:

* Go to `Browse` -> `Task Instances` and find the tasks you want to backfill.

* In this example, the target backfill task is `fill_meu_v2_retention`. However, we also need to clear the `drop_meu_v2_retention` task to make sure data is not duplicated. Select all the task runs for `fill_meu_v2_retention` and `drop_meu_v2_retention` during the backfill period; and click on `With selected` -> `clear`

![Screen Shot 2019-10-18 at 11 43 31 AM](https://user-images.githubusercontent.com/11540582/67084130-a52ea300-f19c-11e9-993d-13e03f89a65c.png)

* Airflow is not going to `catchup` if there are already completed `DAG Runs`, we need to clear those up to trigger the `catchup` process. See [Start from Dag Runs view](#start-from-dag-runs-view).

You should start seeing tasks running shortly.

#### Start from Dag Runs view

The DAG Runs view is the best fit to re-run DAG(s)/task(s) that already have DAG runs.

**Only use this option if:**
- your DAG is configured with `catchup=True`,
- the tasks' dependencies have been met,
- the backfill period is covered by the DAG's `start_date` - `end_date` range,
- the task runs for the task you want to backfill do not exist, or have been deleted using the instruction [here](#start-from-task-instance-view). If not, please see [Start from Task Instance view](#start-from-task-instance-view)

In order to do this, you need to:
- pause the DAG
- go to `Browse` -> `DAG Runs`
- `Add Filter` and filter by `Dag Id`
- Once you've found all the DAG Runs within your backfill period, select all of them and delete them.
- `Unpause` your DAG to trigger the `catchup` process.

Once you've deleted the DAG runs, these DAG runs will disappear on the Tree view. However, any task instances for those DAG runs (which existed before you deleted the DAG runs) will still be there. The scheduler will automatically schedule new DAG runs and only run the tasks that have not been completed.

### Backfill Request view

The [backfill request view](https://airflow.githubapp.com/admin/backfillrequest/) is built on top of Airflow CLI and is best fit for backfilling a long period of time.

**Only use this option if:**
- The backfill period will not be worked on by the scheduler:
- The backfill period is not within the DAG's `start_date` - `end_date` range; or
- There are existing DAG runs for every DAG run in the backfill period; or
- The DAG has `catchup=False`; or
- The DAG is paused.

Use the Backfill request UI to submit a backfill request. Backfill requests are first come first served. You can watch the position of your request in the queue by looking at the [Status tab](https://airflow.githubapp.com/admin/backfillrequest#status).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk closed issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk closed issue #18317:
URL: https://github.com/apache/airflow/issues/18317


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] ephraimbuddy commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

ephraimbuddy commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-921729212


   @thejens I have assigned you to this ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] thejens commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

thejens commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-923758407


   It seems backfilling comes with some baggage in terms of definitions.
   
   What I'm after is a way to insert multiple dag-runs for historical dates in bulk from the UI, possibly with some tasks already marked as complete/skipped, as well as clearing tasks/dagruns between certain dates. 
   
   I don't see the need for a dedicated backfill process to run, the scheduler could take care of that I believe, if tasks and dags are idempotent they don't even need to care about execution order, if order matters I guess `depends_on_past` should be set on the tasks(?) and the scheduler should handle it(?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] thejens edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

thejens edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-923758407


   It seems backfilling comes with some baggage in terms of definitions.
   
   What I'm after is a way to insert multiple dag-runs for historical dates in bulk from the UI, possibly with some tasks already marked as complete/skipped, as well as clearing tasks/dagruns between certain dates. 
   
   I don't see the need for a dedicated backfill process to run, the scheduler could take care of that I believe, if tasks and dags are idempotent they don't even need to care about execution order, if order matters I guess `depends_on_past` should be set on the tasks(?) and the scheduler should handle it(?)
   
   I also don't consider backfill an afterthought. We often instrument new DAGs and want to execute them for historical dates to create historical data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-921840479

Backfill is entirely different thing. It has to run sometimes for hours and actively monitor the backfill process, react to failures etc. So running that from the webserver is not the best idea - ideally this should be another component in scheduler or a separate component like triggerer (coming in 2.2) to run the backfill command and the UI should at most be used to trigger it, and display the status. Taking into account that backfill is an "afterthought" - not something that is and should be done on a regular basis - having a separate component to serve that case (where there is a CLI for ad-hoc operations) is not the highest priority.

So in short this task is more of a backend/architecture change than UI , and it's quite a complex piece I think especially if you want to make sure that it works with multiple schedulers for example.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-921840479

The UI backfill requires a bit more than "simple implementation". This has been discussed several times at the devlist and the problem here is not the UI but "control plane". When you use the CLI, the admin user fully controls and manages the terminal, all the errors and potentially long running process the backfill might be., If you backfill a lot of data, it can take a lot of time and backfill generaly work in the way that it will sequentially run historical runs.

Backfill is entirely different thing. It has to run sometimes for hours and actively monitor the backfill process, react to failures etc. So running that from the webserver is not the best idea - ideally this should be another component in scheduler or a separate component like triggerer to run the backfill command and the UI should at most be used to trigger it, and display the status. Taking into account that backfill is an "afterthought" - not something that is and should be done on a regular basis - having a separate component to serve that case (where there is a CLI for ad-hoc operations) is not the highest priority.

So in short this task is more of a backend/architecture change than UI , and it's quite a complex piece I think especially if you want to make sure that it works with multiple schedulers for example.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-921840479

The UI backfill requires a bit more than "simple implementation". This has been discussed several times at the devlist and the problem here is not the UI but "control plane". When you use the CLI, the admin user fully controls and manages the terminal, all the errors and potentially long running process the backfill might be., If you backfill a lot of data, it can take a lot of time and backfill generaly work in the way that it will sequentially run historical runs.

Backfill is entirely different thing. It has to run sometimes for hours and actively monitor the backfill process, react to failures etc. So running that from the webserver is not the best idea - ideally this should be another component in scheduler or a separate component like triggerer to run the backfill command and the UI should at most be used to trigger it, and display the status. Taking into account that backfill is an "afterthought" - not something that is and should be done on a regular basis - having a separate component to serve that case (where there is a CLI for ad-hoc operations) is not the highest priority.

So in short this task is more of a backend/architecture change than UI , and it's quite a complex piece I think especially if you want to make sure that it works with multiple schedulers for example.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-921840479

The UI backfill requires a bit more than "simple implementation". This has been discussed several times at the devlist and the problem here is not the UI but "control plane". When you use the CLI, the admin user fully controls and manages the terminal, all the errors and potentially long running process the backfill might be., If you backfill a lot of data, it can take a lot of time and backfill generaly work in the way that it will sequentially run historical runs.

So in short this task is more of a backend/architecture change than UI , and it's quite a complex piece I think especially if you want to make sure that it works with multiple schedulers for example.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] kimyen edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

kimyen edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-923415688

We have implemented 2 versions of backfill tools for our users. I want to share them here and suggest another version that may be more general to add to Airflow.

### Backfill Queue

In this version, we were using CeleryExecutor.

#### User experience

Users go to a UI page to trigger a backfill. They would submit a form specifying the dag, the list of tasks, the date range, and some flags (run backwards, mark success, ...)

#### Extra components added

#### Disadvantages

- Only runs 1 backfill at a time, many backfill stuck in queue
- When backfill is aborted, all running DAG runs remains in `running` state and still require a manual action

### Backfill on demand

In this version, we use the KubernetesExecutor.

#### User experience

User can get logs and abort the backfill via different chatops command (chatops server APIs).

#### Extra components added

- K8s Backfill template wrapped by `if backfill enabled` (default to `False`)
- chatops server (which we already have to handle deploying DAG changes)

#### Disadvantages

- When backfill is aborted, all on-going DAG runs remains in `running` state and still require a manual action

### Proposed solution

#### User experience

User goes to the DAG page and trigger a backfill with data range and other backfill flags

#### Extra components

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] kimyen edited a comment on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

kimyen edited a comment on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-923415688


   We have implemented 2 versions of backfill tools for our users. I want to share them here and suggest another version that may be more general to add to Airflow.
   
   ### Backfill Queue
   
   In this version, we were using CeleryExecutor.
   
   #### User experience
   
   Users go to a UI page to trigger a backfill. They would submit a form specifying the dag, the list of tasks, the date range, and  some flags (run backwards, mark success, ...)
   
   After the submission is accepted, the user go to another page to watch the status of the backfill (queued, running with % of success, and list of "aborted" backfill dues to un-recoverable errors or backfill deadlock.) From this page, the author of the backfill can also abort the backfill that is in queued or is running or retry a backfill from where it fails.
   
   #### Extra components added
   
   - Redis queues to hold the queued backfill submissions, running backfill chunks, and aborted backfills.
   - Backfill UI plugins which renders the form and the status page.
   - Backfill API which handle read and write to the Redis queue, used by the UI component.
   - Backfill "worker" which dequeue and run the backfill (via Python) with timeout and retry.
   
   #### Disadvantages
   
   - Only runs 1 backfill at a time, many backfill stuck in queue
   - When backfill is aborted, all running DAG runs remains in `running` state and still require a manual action
   
   ### Backfill on demand
   
   In this version, we use the KubernetesExecutor.
   
   #### User experience
   
   User issues a backfill (in our case, it is in form of a chatops command, which is a http request to the chatops server. The chatops server have access to our K8s cluster, and bring up a pod that runs backfill.) 
   
   User can get logs and abort the backfill via different chatops command (chatops server APIs).
   
   #### Extra components added
   
   - K8s Backfill template wrapped by `if backfill enabled` (default to `False`)
   - chatops server (which we already have to handle deploying DAG changes)
   
   #### Disadvantages
   
   - When backfill is aborted, all on-going DAG runs remains in `running` state and still require a manual action
   
   ### Proposed solution
   
   #### User experience
   
   User goes to the DAG page and trigger a backfill with data range and other backfill flags
   
   #### Extra components
   
   - K8s Backfill template as described above. The UI would be using this template to bring up a new pod that runs backfill based on the user's input
   - UI pages that allows user to select data range and flags
   - UI pages that show status of the backfill or error messages from the backfill pod
   - Cron job that clean up completed backfill pod (success or fail)
   
   
   **_I have wanted to share our ideas and hope to convert with Airflow's implementation so that we do not need to manage our own patch. We are happy to see this issue being assigned. Please let me know if there are more information that I can share to help pushing this further. Otherwise, we will be happy with whichever implementation Airflow chose to support better backfill experience._** 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #18317: Better Backfill User Experience

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-924730862

> Hi, I guess the fact the manual for how to backfill being 4 pages long if printed speaks to the need for a better UX.

I understand your frustration with Airflow being complex system, but this is a fact of life that you have to live with and embrace it, I am afraid.

The `4 page long manual needed speaks to the need of better UX` is is quite a bold statement which completely does not take into account who are the users, what are the use cases, how complex cases are being handled. I am not sure how you could make such a "general" statement because it is extremely narrow-viewed IMHO and is only valid for one small group of users (who are not even likely to be Airflow Users).

If you are developing "non-skilled users" tool - then this statement would be right. But If you create a tool for specialists who know what they are doing and have to be able to execute a number of complex tasks with plenty of variations then well you need to develop tool that does the job - and sometimes it means some operations there require complex UX.

Take a look at GIT. Is the UX simple? Forget it. I know by heart just a few commands (probably about few % of what's tehre) . But whenever I need something I search google, possibly even look up the documentation (sometimes many tens of pages long) and find the right thing, execute one command with ~ 30 parameters which does exactly what I want.

This is the same with Airflow. It's UX is complex because it allows for complex operations to be executed.

And since Airflow is developed by community and you have the feeling that the UX is too complex - you are absolutely free to propose changes to simplify it. If you have any concrete ideas how to do it. I would encourage you to do it. We have discussions in GitHub, Devlist discussions, we have AIPs that you can write and sent to the devlist and propose improvements. You are absolutely free to do it (from me you just got some explanation and warnings that if you want to approach it, you will likely have to think about consequences and impact it has on the architecture).

BTW. Since this is neither a bug, nor concrete feature proposal, I will convert it into discussion.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org