You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Jong Kim (JIRA)" <ji...@apache.org> on 2016/10/25 00:06:58 UTC
[jira] [Created] (AIRFLOW-593) Tasks do not get backfilled
sequentially
Jong Kim created AIRFLOW-593:
--------------------------------
Summary: Tasks do not get backfilled sequentially
Key: AIRFLOW-593
URL: https://issues.apache.org/jira/browse/AIRFLOW-593
Project: Apache Airflow
Issue Type: Bug
Components: DagRun, scheduler
Affects Versions: Airflow 1.7.1.3
Reporter: Jong Kim
Priority: Minor
I need to have the tasks within a DAG complete in order when running backfills. I am running on my mac locally using SequentialExecutor.
Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a start_date: datetime(2016, 10, 20, 11, 0, 0). The DAG consists of 3 tasks, which must complete in order. task0 -> task1 -> task2. This dependency is set using .set_downstream().
Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off toggle in the webserver, and issue "airflow scheduler", which will automatically backfill starting from start_date.
It will backfill for 2016/10/20 and 2016/10/21. I expect backfill to run like the following sequentially:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 20, 11, 0, 0) task2
datetime(2016, 10, 21, 11, 0, 0) task0
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task2
With 'depends_on_past': False, I see Airflow running tasks grouped by sequence number something like this, which is not what I want:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 21, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 20, 11, 0, 0) task2
datetime(2016, 10, 21, 11, 0, 0) task2
With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to run like what I need to, but instead it runs some tasks out of order like this:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task0 <- out of order!
datetime(2016, 10, 20, 11, 0, 0) task2 <- out of order!
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task2
Is this a bug? If not, am I understanding 'depends_on_past' and 'wait_for_downstream' correctly? What do I need to do?
The only remedy I can think of is to backfill each date manually.
Public gist of DAG: https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)