You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Ash Berlin-Taylor (Jira)" <ji...@apache.org> on 2019/10/08 11:58:00 UTC

[jira] [Resolved] (AIRFLOW-2271) Undead tasks are heartbeating and not getting killed

     [ https://issues.apache.org/jira/browse/AIRFLOW-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ash Berlin-Taylor resolved AIRFLOW-2271.
----------------------------------------
    Resolution: Done

I think this has been fixed in 1.10.0 as part of AIRFLOW-1623

> Undead tasks are heartbeating and not getting killed
> ----------------------------------------------------
>
>                 Key: AIRFLOW-2271
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2271
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: worker
>    Affects Versions: 1.9.0
>            Reporter: Greg
>            Priority: Major
>         Attachments: Airflow_zombies_masked.png
>
>
> Background: We had a resource leak in some of our Airflow operators, so after the task is completed, the connection pool was not disposed and the processes were still running (see attached screenshot). It caused the execution pool (size=16) being exhaused after a couple of days.
>  Investigation:
>  We checked those task instances and related jobs in the database, and found mismatch:
>  SQL:
> {code:sql}
> select 
> ti.execution_date,
> ti.state AS task_state,
> ti.start_date AS task_start_dt,
> ti.end_date As task_end_dt,
> j.id AS job_id,
> j.state AS job_state,
> j.start_date AS job_start_dt,
> j.end_date AS job_end_dt,
> j.latest_heartbeat
> from task_instance ti
> join job j
> on j.id=ti.job_id
> where ti.task_id='backup_data_tables'
> order by task_start_dt DESC
> {code}
> ||execution_date||task_state||task_start_dt||task_end_dt||job_id||job_state||job_start_dt||job_end_dt||latest_heartbeat||
> |2018-03-23 23:00:00|success|2018-03-27 08:42:12.846058|2018-03-27 08:42:17.408723|10925|success|2018-03-27 08:42:12.768759|2018-03-27 08:42:22.815474|2018-03-27 08:42:12.768773|
> |2018-03-22 23:00:00|success|2018-03-23 23:02:44.079996|2018-03-24 01:08:52.842612|9683|running|2018-03-23 23:02:44.010813| |2018-03-26 11:29:15.928836|
> |2018-03-21 23:00:00|success|2018-03-22 23:02:14.254779|2018-03-23 01:07:58.322927|9075|running|2018-03-22 23:02:14.199652| |2018-03-26 11:29:16.570076|
> |2018-03-20 23:00:00|success|2018-03-21 23:02:33.417882|2018-03-22 01:16:56.695002|8475|running|2018-03-21 23:02:33.33754| |2018-03-26 11:29:16.529516|
> |2018-03-19 23:00:00|success|2018-03-21 13:20:36.084062|2018-03-21 15:32:51.263954|8412|running|2018-03-21 13:20:36.026206| |2018-03-26 11:29:16.529413|
> As shown in the result set above, jobs of the completed tasks are still running and heartbeating several days after the actual task is completed, stopped only after we killed them manually.
>  In the log files of the tasks we see a bunch of entries like below, which show that _kill_process_tree()_ method is envoked every ~5sec:
> {code:java}
> [2018-03-28 13:03:33,013] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
> [2018-03-28 13:03:38,211] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
> [2018-03-28 13:03:43,290] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
> [2018-03-28 13:03:48,416] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
> [2018-03-28 13:03:53,604] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
> {code}
> After some debugging we found that _LocalTaskJob.terminating_ flag is set to _True_, but processes are still not getting killed, moreover, job is still heartbeating.
>  Expected result: Airflow is responsible for shutting down the processes, not leaving undeads, even if force kill is needed.
> Possible fix:
>  We did the following two changes in the code (we have fixed it in our fork):
>  - _LocalTaskJob._execute_ - do not heartbeat if task is terminating
>  - _kill_process_tree_ - add bool argument kill_root, and kill the root process after descendants if True
> After that all the tasks having that resource leak were shutting down correctly, without leaving any "undead" processes.
> Would love to get some feedback from expects about this issue and the fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)