You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/08 00:54:43 UTC

[GitHub] [airflow] kaxil edited a comment on issue #14422: on_failure_callback does not seem to fire on pod deletion/eviction

kaxil edited a comment on issue #14422:
URL: https://github.com/apache/airflow/issues/14422#issuecomment-815368180


   I think this happened in 2.0.1 mainly because of the following trace
   
   When you delete the POD, the KubernetesExecutor executes the following:
   
   1)
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/executors/kubernetes_executor.py#L195-L200
   
   i.e. it tried to reschedule your POD as evident by the logs in the Issue description too.
   
   2) 
   
   Which then executes the following and puts the TaskInstance Key to `result_queue`:
   
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/executors/kubernetes_executor.py#L350-L359
   
   3)
   
   The TI is then marked with state `RESCHEDULE` (atleast according to executor events) in:
   
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/executors/kubernetes_executor.py#L522-L542
   
   (4)
   Now all the above 3 events were happening from `KubernetesExecutor` point of view.
   
   At the same time when the POD was killed (sent SIGTERM), the Task Pod receives the SIGTERM and executes the following call since we override SIGTERM call and raises `AirflowException` (which matches your logs and stacktrace):
   
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/models/taskinstance.py#L1238-L1241
   
   This `AirflowException` is handled here:
   
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/models/taskinstance.py#L1142-L1150
   
   which then calls inside `handle_failure`
   
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/models/taskinstance.py#L1484-L1490
   
   which does not run a failure_callback. This bug might have been introduced in https://github.com/apache/airflow/commit/efe163a1fddfd66fa402231906e96733efddf8af where we moved running callbacks in `LocalTaskJob`:
   
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/local_task_job.py#L123-L126
   
   https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/local_task_job.py#L144-L153
   
   I don't see `LocalTaskJob` exit logs in your trace so I am not sure why that happened.
   
   Secondly, recently @jedcunningham changed (3) where we mark task as RESCHEDULED to FAILED in https://github.com/apache/airflow/pull/14810 -- which means atleast the logging around that will be taken care off, we still need to investigate on why LocalTaskJob was not executed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org