You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Gero Vermaas (Jira)" <ji...@apache.org> on 2019/08/22 07:57:00 UTC

[jira] [Created] (AIRFLOW-5283) Separate scheduling jobs from executing jobs

Gero Vermaas created AIRFLOW-5283:
-------------------------------------

             Summary: Separate scheduling jobs from executing jobs
                 Key: AIRFLOW-5283
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5283
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.10.4, 1.9.0
            Reporter: Gero Vermaas


Currently, Airflow does not schedule new jobs if the number of active runs > `max_active_runs`, [see this|[https://github.com/apache/airflow/blob/d760d63e1a141a43a4a43daee9abd54cf11c894b/airflow/jobs.py#L768]] for Airflow 1.9, Airflow 1.10 behaves the same.  

A result of this is that if a DAG (incidentally) runs longer than the time between scheduled DAG runs, some runs will be missed because the next DAG run being scheduled is first planned one after the one that ran longer.

For example, imagine DAG runs every hour and the DAG run of 02:00 takes (for some reason) 2 hours 45 minutes to complete instead of the usual 15 minutes. And the `max_active_runs` is set to 1.
This would mean that:
 * The DAG run of 02:00 is finished at 04:45
 * The DAG runs of 03:00 and 04:00 are not scheduled because there is already a DAG active and `max_active_runs` is set to 1
 * The next DAG run scheduled will be the one from 05:00
 * The DAG runs of 03:00 and 04:00 are never scheduled.

The problem is that scheduling and execution of DAG runs are now both tight to the `max_active_runs` setting. This should be separated so that jobs are scheduled at all planned times, but only `max_active_runs` are executed concurrently. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)