You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Daniel Imberman (Jira)" <ji...@apache.org> on 2020/03/29 18:46:00 UTC

[jira] [Closed] (AIRFLOW-20) Improving the scheduler by making dag runs more coherent

     [ https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Imberman closed AIRFLOW-20.
----------------------------------
    Resolution: Auto Closed

> Improving the scheduler by making dag runs more coherent
> --------------------------------------------------------
>
>                 Key: AIRFLOW-20
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-20
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>            Reporter: Bolke de Bruin
>            Assignee: zgl
>            Priority: Major
>              Labels: backfill, database, scheduler
>
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the documentation. If we are
> able to fix this with none or little consequences for current setups that should be preferred, I think.
> The dependency explainer is really great work, but it doesn’t address the core issue.
> If you consider a DAG a description of cohesion between work items (in OOP java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java terms an instance). 
> Tasks are then the description of a work item and a TaskInstance the instantiation of the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation to other TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception. Tasks are instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to start looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know if it is really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on the interval then start_date is the first run date



--
This message was sent by Atlassian Jira
(v8.3.4#803005)