You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/01 15:32:55 UTC

[GitHub] [airflow] yeachan153 opened a new issue #21900: Enable kubernetes_pod_operator to reattach_on_restart when the worker dies

yeachan153 opened a new issue #21900:
URL: https://github.com/apache/airflow/issues/21900


   ### Description
   
   The `kubernetes_pod_operator` currently has a `reattach_on_restart` parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running.
   
   We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the `on_kill` method:
   https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1425
   
   This ends up deleting the pod that was created:
   https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L438
   
   We currently got around this problem by removing the the `on_kill` call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for the `kubernetes_pod_operator` and modified the [is_eligible_to_retry](https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1825) function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed.
   
   Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, stopping the task now does not kill the pod, and clearing the task causes a reattach when we would ideally like a restart.
   
   
   ### Use case/motivation
   
   Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up.
   
   We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment).
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #21900: Enable kubernetes_pod_operator to reattach_on_restart when the worker dies

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #21900:
URL: https://github.com/apache/airflow/issues/21900#issuecomment-1060051267


   Do you have proposal to change the behaviour?  Opening PR for that would be useful. Airflow has ~2000 contributors so you can become one of them. How do you think it can be improved?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org