You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/04/29 20:56:19 UTC

[GitHub] [airflow] repl-chris commented on pull request #19769: Handle stuck queued tasks in Celery for db backend

repl-chris commented on PR #19769:
URL: https://github.com/apache/airflow/pull/19769#issuecomment-1113721608

   @ephraimbuddy we can reproduce this reliably in an isolated test environment, and would love to help test this fix to help get it merged up.  We're running with a redis broker and pgsql results db, on k8s with KEDA auto scaling - it happens when under load during a scale-in event (shutting down a worker)
   
   We've spent a bit of time tracking it down and (at least in our case) it looks to be a problem in celery (possibly https://github.com/celery/celery/issues/7266). Airflow throws the task at celery and it just never executes, never makes it into the `celery_taskmeta` table. Airflow gets a celery task_id back and assumes its running, polling celery for updated status returns `celery_states.PENDING` indefinitely, as it has vanished and that's celery's default response for an unknown task. It seems like celery's warm shutdown on SIGTERM somehow loses tasks sometimes. We've run some tests with [this patch](https://paste.ee/p/YdNAl) and it seems to recover from this state nicely - I believe it's (almost exactly) what you had implemented in [pull 21556](https://github.com/apache/airflow/pull/21556).  
   
   I do have some concerns around `self.task` - in your implementation when setting the task back to `SCHEDULED` it is never cleared out of `self.task`, so if the re-scheduled task happens to be picked up by a different scheduler instance it will just stay in there forever (and in `self.running` come to think of it - they'd potentially accumulate over time and eat up all the open slots). I think it would be desirable to add a filter so each executor will only "re-schedule" tasks which it "owns" - this way the executor can clean up its own internal structures at the same time....and any tasks which are not owned by a running executor will get picked up by the adoption code and then "re-scheduled" by the new owner.  I tried taking a stab at that change but it didn't quite work and I haven't figured out why yet....so maybe there's something going on I don't fully understand, I'm going to keep poking around at it though.  Let me know if I should open a new PR, or if you'd like to resurre
 ct your existing one...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org