You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Shantanu (JIRA)" <ji...@apache.org> on 2019/08/02 22:19:00 UTC

[jira] [Created] (AIRFLOW-5102) Workers fail to shutdown jobs after failed heartbeats

Shantanu created AIRFLOW-5102:
---------------------------------

             Summary: Workers fail to shutdown jobs after failed heartbeats
                 Key: AIRFLOW-5102
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5102
             Project: Apache Airflow
          Issue Type: Bug
          Components: worker
    Affects Versions: 1.10.3
            Reporter: Shantanu
            Assignee: Shantanu


If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down: [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/local_task_job.py#L109]

 

However, at some point, a change was made to catch exceptions inside the heartbeat: [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/base_job.py#L194]

LocalTaskJob now thinks heartbeats always succeed.

 

This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)