You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/16 01:16:37 UTC

[GitHub] [airflow] houqp edited a comment on issue #14422: on_failure_callback does not seem to fire on pod deletion/eviction

houqp edited a comment on issue #14422:
URL: https://github.com/apache/airflow/issues/14422#issuecomment-820839115


   Here is what I think the trace is. Every pod starts with the local task job (monitoring/manager task), which runs the actual task code (`_run_raw_task`) in a subprocess. When the pod is force killed, the local task job will receive the SIGTERM first:
   
   https://github.com/apache/airflow/blob/d7bc217957b65471ca5f2e259bba15c71a2b0c41/airflow/jobs/local_task_job.py#L75-L79
   
   This line prints the first `Received SIGTERM. Terminating subprocesses` log output. Then in `self.on_kill`, it sends another SIGTERM to the subprocess that runs that actual task, and which hits the following code:
   
   
   https://github.com/apache/airflow/blob/d7bc217957b65471ca5f2e259bba15c71a2b0c41/airflow/models/taskinstance.py#L1263-L1266
   
   This line handler prints the second `Received SIGTERM. Terminating subprocesses` log output.
   
   
   The design in https://github.com/apache/airflow/pull/10917 is to move all success/failure callback invocations into local task job so we can handle https://github.com/apache/airflow/issues/11086. We can't rely on the actual task process (`_run_raw_task`) to execute the callback because that process itself runs arbitrary user code and can be force killed by the kernel due to OOM or it could crash by itself due to other bugs. The only safe way to make sure we execute callbacks no matter how the task subprocess exits is to execute them from the monitoring/manager task (local task job), which doesn't run user code so we can tightly control its behavior and memory foot print.
   
   I think the source of this bug is because local job task only calls `self.on_kill` in the SIGTERM handler, which is only responsible for killing the task subprocess using SIGTERM:
   
   https://github.com/apache/airflow/blob/d7bc217957b65471ca5f2e259bba15c71a2b0c41/airflow/jobs/local_task_job.py#L155-L157
   
   Here is a potential fix. In local task job's `signal_handler` after we called `self.on_kill`, we can explicitly call `self.handle_task_exit`, which will check the task state and execute callbacks accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org