You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/01 11:11:11 UTC
[GitHub] [airflow] yeachan153 opened a new issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies
yeachan153 opened a new issue #21892:
URL: https://github.com/apache/airflow/issues/21892
### Apache Airflow version
2.1.4
### What happened
We are currently using Airflow 2.1.4 with celery executor running on Kubernetes. For our use case, we have many individual upstream ETL dags, followed by many downstream user created DAGs that depend on the upstream ETL dagruns to finish for that day before starting. User jobs can also depend on other upstream user jobs. Overall, downstream user dags can depend on multiple upstream ETLs, one upstream ETL can be required for many downstream user DAGs, and users dags can also depend on more upstream user dag(s):
To ensure that the user created task does not execute before the upstream ETL(s)/other user dags have finished and the dependent data is available, we currently use the `ExternalTaskSensor` and provide the name of the upstream dag(s) to the argument `external_dag_id`.
We found that the scheduler does not distinguish between tasks that actually do work (e.g. running an ETL task) and the `ExternalTaskSensor` of a dag which just waits for the upstream dag to finish when taking decisions on which tasks to set to the queued state: https://github.com/apache/airflow/blob/2.1.4/airflow/jobs/scheduler_job.py#L322
At some point, only `ExternalTaskSensor`'s were up for execution, blocking the upstream ETLs from running. We tried to get around this problem by increasing the priority weights of tasks that run in the ETL dags, since these tasks should always run before user related tasks. However, we still run into the same problem when there are user dags that depend on the dagrun of other user dags. In these cases, it's not feasible for us to try and identify the entire dependency chain each time and try to set the correct priority weight so the scheduling does not end up blocked.
In the end, we ended up replacing this [line](https://github.com/apache/airflow/blob/614858fb7d443880451e6111b27fdaf942f563a4/airflow/jobs/scheduler_job.py#L331) with `.order_by(func.random())`, such that at every scheduler loop, we don't query the tasks to set to queued deterministically.
### What you expected to happen
The scheduler can queue and run upstream tasks first.
We noticed that dagrun scheduling behaviour also takes into consideration the last scheduling decision. Perhaps something similar can be added when querying task instances so that the result of tasks to queue is not deterministic? https://github.com/apache/airflow/blob/614858fb7d443880451e6111b27fdaf942f563a4/airflow/models/dagrun.py#L248
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux 10 (buster)
### Versions of Apache Airflow Providers
apache-airflow-providers-amazon==2.2.0
apache-airflow-providers-celery==2.0.0
apache-airflow-providers-cncf-kubernetes==2.0.2
apache-airflow-providers-docker==2.1.1
apache-airflow-providers-elasticsearch==2.0.3
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-google==5.1.0
apache-airflow-providers-grpc==2.0.1
apache-airflow-providers-hashicorp==2.1.0
apache-airflow-providers-http==2.0.1
apache-airflow-providers-imap==2.0.1
apache-airflow-providers-microsoft-azure==3.1.1
apache-airflow-providers-mysql==2.1.1
apache-airflow-providers-odbc==2.0.1
apache-airflow-providers-postgres==2.2.0
apache-airflow-providers-redis==2.0.1
apache-airflow-providers-sendgrid==2.0.1
apache-airflow-providers-sftp==2.1.1
apache-airflow-providers-slack==4.0.1
apache-airflow-providers-sqlite==2.0.1
apache-airflow-providers-ssh==2.1.1
### Deployment
Other Docker-based deployment
### Deployment details
We are using `apache/airflow:2.1.4-python3.7` as the base image, and deploying the components in Kubernetes (GKE):
- Airflow Webserver (x2)
- Airflow Scheduler (x2)
- Airflow Worker (x4, parallelism 200)
- Redis broker
### Anything else
Issue occurs almost daily without the randomisation patch (most of are DAGs are on a daily schedule).
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1055315312
Thanks for opening your first issue here! Be sure to follow the issue template!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] tanelk edited a comment on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies
Posted by GitBox <gi...@apache.org>.
tanelk edited a comment on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1055858365
Would using `mode='reschedule'` on the `ExternalTaskSensor`s solve it for you? You might want to increase the `poke_interval` then.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] tanelk commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies
Posted by GitBox <gi...@apache.org>.
tanelk commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1055858365
Would using `mode='reschedule'` on the `ExternalTaskSensor`s solve it for you?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] yeachan153 commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies
Posted by GitBox <gi...@apache.org>.
yeachan153 commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1056694471
@tanelk thanks for your reply. I don't think this would work due to this bug #10790, and even if it did, it seems like more of a workaround rather than an actual fix to the scheduling behaviour
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] tanelk commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies
Posted by GitBox <gi...@apache.org>.
tanelk commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1057168159
Well as far as I know the reschedule mode is made to solve excactly the issue you have - sensors clogging executor slots.
Another alternative would be using defferable sensors. There is `ExternalTaskSensorAsync` in the just released astronomer-providers https://github.com/astronomer/astronomer-providers/blob/main/CHANGELOG.rst
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org