You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Greg (JIRA)" <ji...@apache.org> on 2018/03/30 10:14:00 UTC
[jira] [Created] (AIRFLOW-2271) Undead tasks are heartbeating and
not getting killed
Greg created AIRFLOW-2271:
-----------------------------
Summary: Undead tasks are heartbeating and not getting killed
Key: AIRFLOW-2271
URL: https://issues.apache.org/jira/browse/AIRFLOW-2271
Project: Apache Airflow
Issue Type: Bug
Components: worker
Affects Versions: 1.9.0
Reporter: Greg
Attachments: Airflow_zombies_masked.png
Background: We had a resource leak in some of our Airflow operators, so after the task is completed, the connection pool was not disposed and the processes were still running (see attached screenshot). It caused the execution pool (size=16) being exhaused after a couple of days.
Investigation:
We checked those task instances and related jobs in the database, and found mismatch:
SQL:
{code:sql}
select
ti.execution_date,
ti.state AS task_state,
ti.start_date AS task_start_dt,
ti.end_date As task_end_dt,
j.id AS job_id,
j.state AS job_state,
j.start_date AS job_start_dt,
j.end_date AS job_end_dt,
j.latest_heartbeat
from task_instance ti
join job j
on j.id=ti.job_id
where ti.task_id='backup_data_tables'
order by task_start_dt DESC
{code}
||execution_date||task_state||task_start_dt||task_end_dt||job_id||job_state||job_start_dt||job_end_dt||latest_heartbeat||
|2018-03-23 23:00:00|success|2018-03-27 08:42:12.846058|2018-03-27 08:42:17.408723|10925|success|2018-03-27 08:42:12.768759|2018-03-27 08:42:22.815474|2018-03-27 08:42:12.768773|
|2018-03-22 23:00:00|success|2018-03-23 23:02:44.079996|2018-03-24 01:08:52.842612|9683|running|2018-03-23 23:02:44.010813| |2018-03-26 11:29:15.928836|
|2018-03-21 23:00:00|success|2018-03-22 23:02:14.254779|2018-03-23 01:07:58.322927|9075|running|2018-03-22 23:02:14.199652| |2018-03-26 11:29:16.570076|
|2018-03-20 23:00:00|success|2018-03-21 23:02:33.417882|2018-03-22 01:16:56.695002|8475|running|2018-03-21 23:02:33.33754| |2018-03-26 11:29:16.529516|
|2018-03-19 23:00:00|success|2018-03-21 13:20:36.084062|2018-03-21 15:32:51.263954|8412|running|2018-03-21 13:20:36.026206| |2018-03-26 11:29:16.529413|
As shown in the result set above, jobs of the completed tasks are still running and heartbeating several days after the actual task is completed, stopped only after we killed them manually.
In the log files of the tasks we see a bunch of entries like below, which show that _kill_process_tree()_ method is envoked every ~5sec:
{code:java}
[2018-03-28 13:03:33,013] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
[2018-03-28 13:03:38,211] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
[2018-03-28 13:03:43,290] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
[2018-03-28 13:03:48,416] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
[2018-03-28 13:03:53,604] \{{helpers.py:269}} DEBUG - There are no descendant processes to kill
{code}
After some debugging we found that _LocalTaskJob.terminating_ flag is set to _True_, but processes are still not getting killed, moreover, job is still heartbeating.
Expected result: Airflow is responsible for shutting down the processes, not leaving undeads, even if force kill is needed.
Possible fix:
We did the following two changes in the code (we have fixed it in our fork):
- _LocalTaskJob._execute_ - do not heartbeat if task is terminating
- _kill_process_tree_ - add bool argument kill_root, and kill the root process after descendants if True
After that all the tasks having that resource leak were shutting down correctly, without leaving any "undead" processes.
Would love to get some feedback from expects about this issue and the fix.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)