You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "Michael Smith (JIRA)" <ji...@apache.org> on 2019/06/07 15:40:00 UTC

[jira] [Created] (AIRFLOW-4747) Airflow Scheduling and DAG Parsing

Michael Smith created AIRFLOW-4747:
--------------------------------------

Summary: Airflow Scheduling and DAG Parsing
Key: AIRFLOW-4747
URL: https://issues.apache.org/jira/browse/AIRFLOW-4747
Project: Apache Airflow
Issue Type: Wish
Components: scheduler
Affects Versions: 1.10.2
Reporter: Michael Smith

I read somewhere that there was going to be an attempt to decouple Airflow's DAG parsing from its scheduler function. My assumption would be that this could be achieved, for example, by driving Scheduler actions (almost?) entirely from the Airflow database. This would eliminate the need for a continuously running DAG parse process?

At present we observe significant lag and significant overheads with the current (1.10.2) model of scheduling which appears to be heavily coupled with the DAG parse. In our environment DAG parse times are typically >1 sec per DAG. This means a single DAG parse cycle can take several minutes. DAG parsing is a large CPU overhead (on a single node cloud VM we've been forced to allocate 2 cpu nodes for example). In addition production jobs suffer from fairly large lag times between tasks (time between task end and start of follow on task). This can be in the order of minutes even when task slots are available.

Is anyone working on this enhancement or could provide guidance on resolving (possibly a configuration issue our side, but I have experimented with configuration options extensively).

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)