You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/08/26 12:16:32 UTC

[GitHub] [airflow] deepchand opened a new issue, #25981: Increased scheduling loop time which is causing missing heartbeat of scheduler

deepchand opened a new issue, #25981:
URL: https://github.com/apache/airflow/issues/25981

   ### Apache Airflow version
   
   Other Airflow 2 version
   
   ### What happened
   
   Significant increase in scheduling loop time when we are crossing following parameters 
   ```
   parallelism > 60
   max_active_tasks_per_dag > 60 
   max_active_runs_per_dag > 60
   ```
   We are running around 60 DAG Run of a single dag which have 10 tasks, once we cross around 60 DAG Run scheduling loop time increased upto 300 sec which is causing below warning
   `The scheduler does not appear to be running. Last heartbeat was received X minutes ago`
   and also not schedules new tasks which degrades the performance of airflow. We have are running two scheduler and both have enough resources available. We create all the dag run via rest api only. 
   This is happening due to below function which is updating the state of the each task separately 
   [https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L312](url) 
   Some of the other config settings are below which we are using
   ```
   job_heartbeat_sec = 5
   scheduler_heartbeat_sec = 5
   num_runs = -1
   scheduler_idle_sleep_time = 1
   scheduler_health_check_threshold = 30
   max_tis_per_query = 128
   use_row_level_locking = True
   max_dagruns_to_create_per_loop = 5
   max_dagruns_per_loop_to_schedule = 20
   ```
   <img width="1323" alt="Screenshot 2022-08-26 at 5 03 06 PM" src="https://user-images.githubusercontent.com/18530606/186894944-3337e824-2345-4078-abca-0a293eda52e3.png">
   <img width="834" alt="Screenshot 2022-08-26 at 5 45 28 PM" src="https://user-images.githubusercontent.com/18530606/186901378-d2cc21ba-d320-4d3d-a8c7-254e8b0ccb8f.png">
   
   
   ### What you think should happen instead
   
   As per my debugging i have seen the below function is taking significant time while updating the state of task 
   [https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L312](url)
   
   Although we are updating the state of tasks from below function in bulk
   [https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L330](url)
   
   So my question is can we avoid updating the state of tasks from [this](https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L312) function as it is consuming almost 95% time of a scheduling loop because anyway we are updating the state of tasks from [this]([https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L330]) function in bulk 
   
   When i did the same i have found significant decrease in scheduling loop
   <img width="1316" alt="Screenshot 2022-08-26 at 5 22 31 PM" src="https://user-images.githubusercontent.com/18530606/186897601-b2b23c9c-baec-40fe-bdf6-f3c126009097.png">
   
   
   ### How to reproduce
   
   We can reproduce this by increasing the below parameter to  
   parallelism > 60
   max_active_tasks_per_dag > 60 
   max_active_runs_per_dag > 60
   
   Dag should have atleast more then 10 tasks
   Note : In our case our dag is doing ssh on some remote machines and performs some package upgrade tasks on the machines
   
   ### Operating System
   
   Linux (centos 8.2.2004)
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-celery==2.1.0
   apache-airflow-providers-ftp==2.0.1
   apache-airflow-providers-http==2.0.1
   apache-airflow-providers-imap==2.0.1
   apache-airflow-providers-microsoft-azure==3.3.0
   apache-airflow-providers-redis==2.0.1
   apache-airflow-providers-slack==4.1.0
   apache-airflow-providers-sqlite==2.0.1
   ```
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   ```
   Airflow v2.2.2
   Postgres 11
   CeleryExecutor
   Python 3.6
   ```
   Constraints used  - [https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.6.txt](url)
   Cluster Details - 
   Schedulers VM (Having 16vcpu, 64GB memory)- 2
   Wroker VM(16vcpu, 64GB memory) - 4
   Webserver VM - 1
   
   Some of the core config parameters 
   ```
   sql_alchemy_pool_size = 200
   executor = CeleryExecutor
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #25981: Increased scheduling loop time which is causing missing heartbeat of scheduler

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #25981:
URL: https://github.com/apache/airflow/issues/25981#issuecomment-1228417117

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #25981: Increased scheduling loop time which is causing missing heartbeat of scheduler

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #25981:
URL: https://github.com/apache/airflow/issues/25981#issuecomment-1228719824

   As discussed in https://github.com/apache/airflow/discussions/25991 - this is is caused by super-slow database which needs to be investigated by the user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #25981: Increased scheduling loop time which is causing missing heartbeat of scheduler

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #25981: Increased scheduling loop time which is causing missing heartbeat of scheduler
URL: https://github.com/apache/airflow/issues/25981


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org