You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Jong Kim (JIRA)" <ji...@apache.org> on 2016/10/25 00:06:58 UTC

[jira] [Created] (AIRFLOW-593) Tasks do not get backfilled sequentially

Jong Kim created AIRFLOW-593:
--------------------------------

             Summary: Tasks do not get backfilled sequentially
                 Key: AIRFLOW-593
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-593
             Project: Apache Airflow
          Issue Type: Bug
          Components: DagRun, scheduler
    Affects Versions: Airflow 1.7.1.3
            Reporter: Jong Kim
            Priority: Minor


I need to have the tasks within a DAG complete in order when running backfills. I am running on my mac locally using SequentialExecutor.

Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a start_date: datetime(2016, 10, 20, 11, 0, 0). The DAG consists of 3 tasks, which must complete in order. task0 -> task1 -> task2. This dependency is set using .set_downstream().

Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off toggle in the webserver, and issue "airflow scheduler", which will automatically backfill starting from start_date.

It will backfill for 2016/10/20 and 2016/10/21.  I expect backfill to run like the following sequentially:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 20, 11, 0, 0) task2
datetime(2016, 10, 21, 11, 0, 0) task0
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task2

With 'depends_on_past': False, I see Airflow running tasks grouped by sequence number something like this, which is not what I want:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 21, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 20, 11, 0, 0) task2
datetime(2016, 10, 21, 11, 0, 0) task2

With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to run like what I need to, but instead it runs some tasks out of order like this:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task0   <- out of order!
datetime(2016, 10, 20, 11, 0, 0) task2   <- out of order!
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task2

Is this a bug? If not, am I understanding 'depends_on_past' and 'wait_for_downstream' correctly? What do I need to do?

The only remedy I can think of is to backfill each date manually.

Public gist of DAG: https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)