You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/01 11:11:11 UTC

[GitHub] [airflow] yeachan153 opened a new issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies

yeachan153 opened a new issue #21892:
URL: https://github.com/apache/airflow/issues/21892


   ### Apache Airflow version
   
   2.1.4
   
   ### What happened
   
   We are currently using Airflow 2.1.4 with celery executor running on Kubernetes. For our use case, we have many individual upstream ETL dags, followed by many downstream user created DAGs that depend on the upstream ETL dagruns to finish for that day before starting. User jobs can also depend on other upstream user jobs. Overall, downstream user dags can depend on multiple upstream ETLs, one upstream ETL can be required for many downstream user DAGs, and users dags can also depend on more upstream user dag(s):
   
   To ensure that the user created task does not execute before the upstream ETL(s)/other user dags have finished and the dependent data is available, we currently use the `ExternalTaskSensor` and provide the name of the upstream dag(s) to the argument `external_dag_id`.
   
   We found that the scheduler does not distinguish between tasks that actually do work (e.g. running an ETL task) and the `ExternalTaskSensor` of a dag which just waits for the upstream dag to finish when taking decisions on which tasks to set to the queued state: https://github.com/apache/airflow/blob/2.1.4/airflow/jobs/scheduler_job.py#L322
   
   At some point, only `ExternalTaskSensor`'s were up for execution, blocking the upstream ETLs from running. We tried to get around this problem by increasing the priority weights of tasks that run in the ETL dags, since these tasks should always run before user related tasks. However, we still run into the same problem when there are user dags that depend on the dagrun of other user dags. In these cases, it's not feasible for us to try and identify the entire dependency chain each time and try to set the correct priority weight so the scheduling does not end up blocked.
   
   In the end, we ended up replacing this [line](https://github.com/apache/airflow/blob/614858fb7d443880451e6111b27fdaf942f563a4/airflow/jobs/scheduler_job.py#L331) with `.order_by(func.random())`, such that at every scheduler loop, we don't query the tasks to set to queued deterministically. 
   
   ### What you expected to happen
   
   The scheduler can queue and run upstream tasks first.
   
   We noticed that dagrun scheduling behaviour also takes into consideration the last scheduling decision. Perhaps something similar can be added when querying task instances so that the result of tasks to queue is not deterministic? https://github.com/apache/airflow/blob/614858fb7d443880451e6111b27fdaf942f563a4/airflow/models/dagrun.py#L248
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==2.2.0
   apache-airflow-providers-celery==2.0.0
   apache-airflow-providers-cncf-kubernetes==2.0.2
   apache-airflow-providers-docker==2.1.1
   apache-airflow-providers-elasticsearch==2.0.3
   apache-airflow-providers-ftp==2.0.1
   apache-airflow-providers-google==5.1.0
   apache-airflow-providers-grpc==2.0.1
   apache-airflow-providers-hashicorp==2.1.0
   apache-airflow-providers-http==2.0.1
   apache-airflow-providers-imap==2.0.1
   apache-airflow-providers-microsoft-azure==3.1.1
   apache-airflow-providers-mysql==2.1.1
   apache-airflow-providers-odbc==2.0.1
   apache-airflow-providers-postgres==2.2.0
   apache-airflow-providers-redis==2.0.1
   apache-airflow-providers-sendgrid==2.0.1
   apache-airflow-providers-sftp==2.1.1
   apache-airflow-providers-slack==4.0.1
   apache-airflow-providers-sqlite==2.0.1
   apache-airflow-providers-ssh==2.1.1
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   We are using `apache/airflow:2.1.4-python3.7` as the base image, and deploying the components in Kubernetes (GKE):
   - Airflow Webserver (x2)
   - Airflow Scheduler (x2)
   - Airflow Worker (x4, parallelism 200)
   - Redis broker
   
   ### Anything else
   
   Issue occurs almost daily without the randomisation patch (most of are DAGs are on a daily schedule).
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1055315312


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tanelk edited a comment on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies

Posted by GitBox <gi...@apache.org>.
tanelk edited a comment on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1055858365


   Would using `mode='reschedule'` on the `ExternalTaskSensor`s solve it for you? You might want to increase the `poke_interval` then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tanelk commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies

Posted by GitBox <gi...@apache.org>.
tanelk commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1055858365


   Would using `mode='reschedule'` on the `ExternalTaskSensor`s solve it for you? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yeachan153 commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies

Posted by GitBox <gi...@apache.org>.
yeachan153 commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1056694471


   @tanelk thanks for your reply. I don't think this would work due to this bug #10790, and even if it did, it seems like more of a workaround rather than an actual fix to the scheduling behaviour


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tanelk commented on issue #21892: Tasks are queued deterministically, deadlocking the scheduler when there are many dag-on-dag dependencies

Posted by GitBox <gi...@apache.org>.
tanelk commented on issue #21892:
URL: https://github.com/apache/airflow/issues/21892#issuecomment-1057168159


   Well as far as I know the reschedule mode is made to solve excactly the issue you have - sensors clogging executor slots. 
   
   Another alternative would be using defferable sensors. There is `ExternalTaskSensorAsync` in the just released astronomer-providers https://github.com/astronomer/astronomer-providers/blob/main/CHANGELOG.rst


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org