You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kaxil Naik (JIRA)" <ji...@apache.org> on 2019/08/07 16:25:04 UTC

[jira] [Updated] (AIRFLOW-4415) skip status stops propagation randomly.

     [ https://issues.apache.org/jira/browse/AIRFLOW-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kaxil Naik updated AIRFLOW-4415:
--------------------------------
    Fix Version/s:     (was: 1.10.4)
                   2.0.0

> skip status stops propagation randomly.
> ---------------------------------------
>
>                 Key: AIRFLOW-4415
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4415
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>    Affects Versions: 1.8.1
>            Reporter: Feng Mao
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Issue: skip status stop propogation to down streams and get randomly stopped with the dag status marked as failed.
> The issue is located in the version 1.8.1.
> In version 1.8.0 there is a temp fix but removed after this version.
> https://github.com/apache/airflow/commit/4077c6de297566a4c598065867a9a27324ae6eb1
> https://github.com/apache/airflow/commit/92965e8275c6f2ec2282ad46c09950bab10c1cb2
>  
> root casue:
>   In a loop, the scheduler evaluate each dag and all its task dependcies around by around.
>   Each round evaluation happens twice in the context of flag_upstream_failed = false and true.
>  
>   The dag run update method mark the dag run deadlocked which stops the dag and all its tasks from be processed furture.
>   https://github.com/apache/airflow/blob/1.8.1/airflow/models.py#L4184
>   It is due to in no_dependencies_met.  All_sccucess trigger rule misses skipped status check and marks the task as failed when upstream only has skipped tasks.
>   https://github.com/apache/airflow/blob/1.8.1/airflow/models.py#L4152
>   https://github.com/apache/airflow/blob/1.8.1/airflow/ti_deps/deps/trigger_rule_dep.py#L165
>  
>   Each dag update will checks all its task deps and sent ready tasks to run in the context of flag_upstream_failed=false (defalt)
>   https://github.com/apache/airflow/blob/1.8.1/airflow/models.py#L4156   which wont handle skip status propogation.
>  
>   After dag update, dag will checks all its task deps and sent ready tasks to run in the context of flag_upstream_failed=true
>   https://github.com/apache/airflow/blob/1.8.1/airflow/jobs.py#L904
>   which handles skip status propogration.
>   https://github.com/apache/airflow/blob/1.8.1/airflow/ti_deps/deps/trigger_rule_dep.py#L138
>  
>   Two potential causes that will trigger dag update detect a deadlock.
>   The skip status proprogatation rely on detected skipped upstreams (which happens asyncly by other nodes writing to db).
>   If the tasks been evaluated  are not following topoloy order(random order) by priority_weigth. It requried multipe loop rounds to propogate skip statue to all downsteam tasks.
>   Depending on how close the topoloy order the tasks fetched, the proprogation may go further or shorter.
>  
>   The deadlock detetion can be avoid only the following  conditions happen at the same time:
>   1. the skip task (shortcurit operation async process) update db with skipped task status, right after dag update (flag_upstream_failed=false )before dag task checks(flag_upstream_failed=true) in scheduler process.
>   2. dag checks(flag_upstream_failed=true) have all tasks fectch/evaluated in the topology order that skip status can propogate in one evaluations round.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)