You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by "EKC (Erik Cederstrand)" <EK...@novozymes.com> on 2017/02/28 15:10:04 UTC

Failing jobs not properly terminated

Hi folks,


We're experiencing an issue where a failing DAG is not properly terminated. The log file contains a Python stack trace of the failure and then ends with:


    {base_task_runner.py:112} INFO - Running: ['bash', '-c', 'airflow run my_dag my_task 2017-02-22T00:00:00 --job_id 127022 --raw -sd DAGS_FOLDER/my_dag.py']

          [...]

    {jobs.py:2127} WARNING - State of this instance has been externally set to failed. Taking the poison pill. So long.


But the Celery process of the worker is never terminated:


    $ ps aux | grep 127022
    airflow  [...] python3 airflow run my_dag my_task 2017-02-22T00:00:00 --job_id 127022 --raw -sd DAGS_FOLDER/my_dag.py


Somehow, the scheduler does not see the job as failed, so the Celery queue quickly fills up with dead jobs and nothing else gets worked on.


I had a look at jobs.py, and I have a suspicion that the self.terminating flag is never persisted to the database. See https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2122 Is this correct? Everything else I can see in that file does a session.merge() to persist values.


Kind regards,

Erik Cederstrand

Re: Failing jobs not properly terminated

Posted by "EKC (Erik Cederstrand)" <EK...@novozymes.com>.

Here's a follow-up with more information.

I think my suspicion is supported by the Airflow database. It has the task marked as failed in the task_instance table, and the PID matches the dead job:

airflow=> select job_id, task_id, dag_id, state, pid from task_instance where job_id=127022;
 job_id | task_id | dag_id | state  |  pid
--------+---------+--------+--------+--------
 127022 | my_task | my_dag | failed | 146117
(1 row)

But in the jobs table, the job is apparently still running and being heartbeated by something:

airflow=> select id, state, start_date, latest_heartbeat from job where id=127022;
   id   |  state  |         start_date         | latest_heartbeat
--------+---------+----------------------------+---------------------------
 127022 | running | 2017-02-22 14:11:21.447172 | 2017-02-28 16:33:42.254843
(1 row)


But the job logfile clearly stated that it was "taking the pill". Since the job is never marked as failed, the query in DagBag.kill_zombies() (https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L330) does not find the job and it is never killed<https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L330>.

Kind regards,
Erik


________________________________
From: EKC (Erik Cederstrand)
Sent: Tuesday, February 28, 2017 4:10:04 PM
To: dev@airflow.incubator.apache.org
Subject: Failing jobs not properly terminated


Hi folks,


We're experiencing an issue where a failing DAG is not properly terminated. The log file contains a Python stack trace of the failure and then ends with:


    {base_task_runner.py:112} INFO - Running: ['bash', '-c', 'airflow run my_dag my_task 2017-02-22T00:00:00 --job_id 127022 --raw -sd DAGS_FOLDER/my_dag.py']

          [...]

    {jobs.py:2127} WARNING - State of this instance has been externally set to failed. Taking the poison pill. So long.


But the Celery process of the worker is never terminated:


    $ ps aux | grep 127022
    airflow  [...] python3 airflow run my_dag my_task 2017-02-22T00:00:00 --job_id 127022 --raw -sd DAGS_FOLDER/my_dag.py


Somehow, the scheduler does not see the job as failed, so the Celery queue quickly fills up with dead jobs and nothing else gets worked on.


I had a look at jobs.py, and I have a suspicion that the self.terminating flag is never persisted to the database. See https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2122 Is this correct? Everything else I can see in that file does a session.merge() to persist values.


Kind regards,

Erik Cederstrand