You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/09/22 10:06:41 UTC

[GitHub] [airflow] bharatk-meesho opened a new issue, #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

bharatk-meesho opened a new issue, #26587:
URL: https://github.com/apache/airflow/issues/26587

   ### Official Helm Chart version
   
   1.6.0 (latest released)
   
   ### Apache Airflow version
   
   2.3.2
   
   ### Kubernetes Version
   
   4.5.7
   
   ### Helm Chart configuration
   
   ```
   celery:
       ## if celery worker Pods are gracefully terminated
       ## - consider defining a `workers.podDisruptionBudget` to prevent there not being
       ##   enough available workers during graceful termination waiting periods
       ##
       ## graceful termination process:
       ##  1. prevent worker accepting new tasks
       ##  2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
       ##  3. send SIGTERM to worker
       ##  4. wait AT MOST `workers.terminationPeriod` for kill to finish
       ##  5. send SIGKILL to worker
       ##
       gracefullTermination: true
   
       ## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
       ##
       gracefullTerminationPeriod: 180
   
     ## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
     ## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
     ##   to understand with KubernetesPodOperator(), as Pods may continue running
     ##
     terminationPeriod: 120
   ```
   
   ### Docker Image customisations
   
   _No response_
   
   ### What happened
   
   I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, they terminate within few minutes irrespective of any tasks running. 
   
   Is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod.
   
   Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to receive SIGTERM before they complete their execution on worker pod scaling down.
   
   ### What you think should happen instead
   
   What should happen is that either of below two things:
   1) Have an option that worker pods doesn't terminate until all tasks running on that particular worker have completed execution.
   2) That task can be terminated gracefully and same could be started on other worker node.
   
   ### How to reproduce
   
   It can be reproduced by running multiple dags without different execution times setup so that worker scales up first and then scale down. One simple way is to run multiple copies of dag with python operator with random time set for sleep.
   
   ### Anything else
   
   Looking for any sort of solution which doesn't mark task as fail when worker pod scales down.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #26587: Airflow tasks failing with SIGTERM when worker pod downscaling. 
URL: https://github.com/apache/airflow/issues/26587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #26587:
URL: https://github.com/apache/airflow/issues/26587#issuecomment-1256813890

   Thks @thesuperzapper ! interesting approach indeed. looks pretty complex but let's see how it plays out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] thesuperzapper commented on issue #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

Posted by GitBox <gi...@apache.org>.
thesuperzapper commented on issue #26587:
URL: https://github.com/apache/airflow/issues/26587#issuecomment-1256800972

   @potiuk @bharatk-meesho I am making lots of progress on a "task aware" worker autoscaler that will be implemented in the next big release of the [Airflow Helm Chart (User Community)](https://github.com/airflow-helm/charts/tree/main/charts/airflow), you can follow when this happens by subscribing to https://github.com/airflow-helm/charts/issues/339.
   
   Note, the actual code is still private for now (as it's being actively worked on and would be unsafe for people to use), but I am getting close.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #26587:
URL: https://github.com/apache/airflow/issues/26587#issuecomment-1254832076

   This is currently not possible and it is K8S limitation, not our problem. The only possible approach to avoid it is:
   
   1) use CeleryKubernetesExecutor
   2) assign all your long-running tasks to Kubernetes queue
   3) set gracefulTerminationPeriod to be longer than your longest possible running task tht you run via Celery Executor
   
   This approach will work in the way that workers being downscaled are put in offline state and have enough time to complete all tasks before they are killed.
   
   Longer explanation: Currently the "stock" Kubernetes does not allow to downscale selected Pod from ReplicaSet or Deployment - it will randomly pick one and there is no way to change it and for example kill the POD that should be killed. The K8S team is opposing to implement a solution despite a number of people trying to convince them. The latest attempt (which is actually originated by @thesuperzapper - largely because of his Airflow Helm Chart  - is here https://github.com/kubernetes/kubernetes/issues/107598 and is activelly discussed, but even if implemented, it will take multiple months to be released and new version of Kuberntes.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #26587:
URL: https://github.com/apache/airflow/issues/26587#issuecomment-1254808398

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org