You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Ash Berlin-Taylor (Jira)" <ji...@apache.org> on 2019/10/08 12:03:00 UTC

[jira] [Assigned] (AIRFLOW-5102) Workers fail to shutdown jobs after failed heartbeats

     [ https://issues.apache.org/jira/browse/AIRFLOW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ash Berlin-Taylor reassigned AIRFLOW-5102:
------------------------------------------

    Assignee: Ash Berlin-Taylor  (was: Shantanu)

> Workers fail to shutdown jobs after failed heartbeats
> -----------------------------------------------------
>
>                 Key: AIRFLOW-5102
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5102
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: worker
>    Affects Versions: 1.10.3
>            Reporter: Shantanu
>            Assignee: Ash Berlin-Taylor
>            Priority: Major
>
> If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down: [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/local_task_job.py#L109]
>  
> However, at some point, a change was made to catch exceptions inside the heartbeat: [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/base_job.py#L194]
> LocalTaskJob now thinks heartbeats always succeed.
>  
> This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)