You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Andy Cooper (JIRA)" <ji...@apache.org> on 2018/08/05 04:30:00 UTC

[jira] [Closed] (AIRFLOW-1828) Scheduler Performance Degrades Overttime

     [ https://issues.apache.org/jira/browse/AIRFLOW-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Cooper closed AIRFLOW-1828.
--------------------------------
    Resolution: Fixed

> Scheduler Performance Degrades Overttime
> ----------------------------------------
>
>                 Key: AIRFLOW-1828
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1828
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.8.1, 1.8.2
>            Reporter: Andy Cooper
>            Priority: Major
>
> Team,
> We are using Airflow very heavily internally on multiple instances. Overtime, as we have added more and more tasks to Airflow we have begun to notice a degradation of scheduler performance. For our most heavy system we are noticing that it will eventually go from completing 15,000 task per hour to less than a 100. One other note for that particular system is that the ~30 DAGs are generated dynamically from a single DAG file.
> We have also begun to see this on lesser used instances we are hosting as well. 
> As a company we are happy to take a look at this ourselves and have in fact dug into the scheduler quite a bit. I am posting here more as an opportunity to gather more insights into what is happening here.
> - What are the causes of this scheduler performance decrease? 
> - Is the only known way to combat this performance decrease to restart the scheduler regularly? 
> - Is restarting the scheduler on a time interval still the recommended way to handle this?
> - In most cases it seems like most of the bottle neck is in moving tasks from null state to scheduled and from scheduled to queued. And often we will see only a portion of DAGs having tasks picked up from the queued state. What causes this and why does it only get worse over time?
> Once we fully understand this problem we are happy to add to documentation or code base in order to resolve this problem or make it more clear for people going forward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)