You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Gerard Toonstra (JIRA)" <ji...@apache.org> on 2017/04/24 19:19:04 UTC

[jira] [Commented] (AIRFLOW-1139) Scheduler runs very slowly when many DAGs in DAG directory

    [ https://issues.apache.org/jira/browse/AIRFLOW-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981718#comment-15981718 ] 

Gerard Toonstra commented on AIRFLOW-1139:
------------------------------------------

Hi David,

That's because the reprocessing of a DAG is tied to the scheduler cycle. This is because DAGs can be dynamic, so there are cases where you don't know what task instances are going to be in there. What basically happens:

- the scheduler starts threads to process DAG files.
- each thread chooses from the available DAGS one dag to process.
- when a DAG instantiates, it will run all code that is at global level (so actually creates task instances). 
- if the dag interval passed, it will create a dagrun db object and a task instance db object, basically scheduling the dagrun. 
- it is the file processing thread that does this, not the main scheduler cycle itself.
- now that the database contains new dagruns and new task instances to schedule, when the main scheduler loop checks for new task instances to run, it will discover those. 
- They get sent to an executor.

The min_file_process_interval is one way to try to manage this, but increasing this will thus generate larger delays between dags that are analyzed for scheduling.

In your case it may be better to reduce max_threads, by default set to 2. This will influence the number of threads allocated to dag file processors.
It could mean that you have one thread that's continuously analyzing dags to process, but you win one thread available for task execution.

I'll raise this on the dev list with a link back here. This way, committers can verify my explanation and there may be a smarter way to improve processing performance.


> Scheduler runs very slowly when many DAGs in DAG directory
> ----------------------------------------------------------
>
>                 Key: AIRFLOW-1139
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1139
>             Project: Apache Airflow
>          Issue Type: Improvement
>    Affects Versions: 1.8.0
>         Environment: macOS Sierra, v10.12.2, MacBook Pro, 2.5 GHz Intel Core i7, 16 GB RAM
>            Reporter: David Vaughan
>            Priority: Minor
>              Labels: performance
>
> When we have several (10-15) DAGs in our DAG directory, and each of them is pretty large (~900 tasks on average), Airflow's periodic re-processing of the DAGs in our DAG directory takes a long time and takes resources away from running DAGs.
> Almost always we only have one DAG actually running at any given time, and the rest are paused. The one running DAG, however, crawls along noticeably slower than if we only have one or two DAGs total in the DAG directory.
> I think it would be nice to have an option to turn off re-processing of DAGs completely, after the initial processing.
> The way we use Airflow right now, we don't edit our existing DAGs frequently, so we have no need for periodic refresh. We have experimented with the min_file_process_interval option in airflow.cfg, but setting it to small numbers causes no noticeable change, and setting it to very large numbers (to emulate not refreshing at all) actually causes the DAG to run much slower than it already was.
> Is anybody else still experiencing this? Are there existing ways to avoid this problem? Here are some links to people referencing, I believe, this same issue, but they're all from last year:
> https://issues.apache.org/jira/browse/AIRFLOW-160
> https://github.com/apache/incubator-airflow/pull/1636
> https://issues.apache.org/jira/browse/AIRFLOW-435
> http://stackoverflow.com/questions/40466732/apache-airflow-scheduler-slowness
> Thanks in advance for any discussion or help.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)