You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "jack (Jira)" <ji...@apache.org> on 2019/11/05 13:16:00 UTC

[jira] [Comment Edited] (AIRFLOW-4881) Zombie collection fails task instances that should be scheduled for retry

    [ https://issues.apache.org/jira/browse/AIRFLOW-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967518#comment-16967518 ] 

jack edited comment on AIRFLOW-4881 at 11/5/19 1:15 PM:
--------------------------------------------------------

Resolved in  [https://github.com/apache/airflow/pull/5511] ?


was (Author: jackjack10):
Resolved in [https://github.com/apache/airflow/pull/5514] ?

> Zombie collection fails task instances that should be scheduled for retry
> -------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4881
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4881
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.3
>            Reporter: Tamas Flamich
>            Priority: Major
>
> In case a task instance
>  * has more attempts than retries (it can happen when the state of the task instance is explicitly cleared) and
>  * task instance is prematurely terminated (without graceful shutdown)
> then zombie collection process of the scheduler can mark the task instance failed instead of retrying it. 
> Steps to reproduce:
> 1 - The task is scheduled for a particular executed date and the following records gets created in the database.
> ||task_id||retries||try_number||max_tries||state||
> |emr_sensor|2|1|2|running|
> 2 - The job owners would like to schedule the task again therefore they clear that state of the task instance. {{try_number}} and {{max_retries}} gets updated.
> ||task_id||retries||try_number||max_tries||state||
> |emr_sensor|2|2|3|running|
> 3 - The Airlflow scheduler gets killed and a new scheduler instance starts looking for zombie tasks. Since {{try_number < max_tries}}, the new state is {{up_for_retry}}. However, there is a bug in the [state update logic|https://github.com/apache/airflow/blob/d5a5b9d9f1f1efb67ffed4d8e6ef3e0a06467bed/airflow/models/dagbag.py#L295] that will revert the {{max_tries}} value to the initial value ({{retries}}).
> ||task_id||retries||try_number||max_tries||state||
> |emr_sensor|2|2|2|up_for_retry|
> 4 - During the next iteration of the scheduler, the task instance gets picked up. However, since {{try_number >= max_tries}}, the new state is {{failed}}.
> ||task_id||retries||try_number||max_tries||state||
> |emr_sensor|2|2|2|failed|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)