You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Bolke de Bruin (JIRA)" <ji...@apache.org> on 2016/04/29 15:59:13 UTC

[jira] [Commented] (AIRFLOW-20) Align start_date with the schedule_interval

    [ https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264082#comment-15264082 ] 

Bolke de Bruin commented on AIRFLOW-20:
---------------------------------------

The DagRun.previous handling is used to connect previous related DagRuns in order to make depends_on_past work with a moving interval or an automatically aligned start_date.

Imagine the following

Use Case 1
Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron schedule of “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is true.

The current scheduler kicks off a run “2016-04-24 00:00:00” and then stops as the previous interval of “2016-04-24 01:10:00” is “2016-04-23 01:10:00”. This cannot be found as it has not taken place and thus the whole thing grinds to a halt and the TaskInstances refuse to run.

What the user expects here is that the first run is “2016-04-24 01:10:00”, ie start_date + interval, unless the start_date is on the interval, ie. start_date is first interval. This is what I address by start_date normalization in the PR. However, the second issue then kicks in as the “previous” run can still not be found.

Use Case 2
Dag2 has an assigned start_date "2016-04-24 01:10:00"  with a cron schedule of “10 1 * * *”. Depends_on_past is true.

The scheduler happily goes through a few runs, but then the dag is updated and the schedule adjusted. Because the previous interval cannot be found by the TaskInstance (emphasis), tasks get stuck again requiring an manual override.

What the user expects here is that the scheduler is smart enough to figure out that we are still running the same dag and that it needs to look up the previous run for that dag and make sure dependencies are met with that previous dagrun in mind.

I don’t think those two use cases are edge cases, considering the amount of questions we get on these subjects.

To resolve the remaining issues (aside from start_date normalization) I first made DagRun aware of its predecessors. Then I strengthened the relationship between TaskInstances and DagRuns slightly, by optionally including a dagrun_id in the TaskInstance. Now a TaskInstance can lookup its predecessors in the previous DagRun and know for sure that it is either the first run or it has a predecessor somewhere in time instead of guessing. 

What I am unsure of is what to consider is the unit of work: 1) is a TaskInstance dependent on its past predecessor ignoring the outcome of the DagRun? (previous TaskInstance can be successful, but previous DagRun as a whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can it be both? In case of 1 and 3 my logic needs to be updated slightly, but that should not be too much of a big deal. However I have difficulty imagining why you want to do that.

> Align start_date with the schedule_interval
> -------------------------------------------
>
>                 Key: AIRFLOW-20
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-20
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: Bolke de Bruin
>              Labels: backfill, database, scheduler
>
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the documentation. If we are
> able to fix this with none or little consequences for current setups that should be preferred, I think.
> The dependency explainer is really great work, but it doesn’t address the core issue.
> If you consider a DAG a description of cohesion between work items (in OOP java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java terms an instance). 
> Tasks are then the description of a work item and a TaskInstance the instantiation of the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation to other TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception. Tasks are instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to start looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know if it is really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on the interval then start_date is the first run date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)