You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/08/29 16:17:16 UTC

[GitHub] [airflow] arkadiusz-bach commented on issue #21225: Tasks stuck in queued state

arkadiusz-bach commented on issue #21225:
URL: https://github.com/apache/airflow/issues/21225#issuecomment-1230535103

   @V0lantis task might stuck in the queued state at least due to:
   - Redis crashed after receiving the task and it had no time / was not configured to save its state to disk, so the task was lost, but scheduler thinks that task is waiting to be picked up workers
   - terminationGracePeriodSeconds  on your worker PODs is too low or it is not there at all.(default is 60 seconds)
    
   This message `worker: Warm shutdown` means that celery received SIGTERM signal and it started gracefull shutdown - it is not going to pick any more tasks from redis queue and it will wait for all of the running tasks to finish
   
   But if you've got some tasks that may be running for longer than terminationGracePeriod then Kubernetes might send SIGKILL first and:
   - Celery will not be able to wait for all of the running tasks to finish(those will end with failed status)
   - it may be able to pick the task from queue, but not able to change it state to running(maybe your case)
   
   Also some of the celery workers might receive SIGKILL signal, when there is not enough memory allocated and it may led to the same behaviour, unfortunately you may not see OOM events in the kubernetes cluster when it happens, becaue when there is more than one process running on the container in Kubernetes then it is chosing randomly one of the child processes within container and sends SIGKILL(Celery is running with Main process and child processes(workers)).
   
   If the Main process receives SIGKILL you will probably see OOM event, but if child then tasks it was processing will fail(in the logs you will be able to see that it received SIGKILL singal) or stuck in queued state if it was able to pick it 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org