You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/02/10 18:27:08 UTC

[GitHub] [airflow] kristoffern opened a new issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

kristoffern opened a new issue #19699:
URL: https://github.com/apache/airflow/issues/19699


   ### Apache Airflow version
   
   2.2.2 (latest released)
   
   ### Operating System
   
   Linux Mint 20.2
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-celery = "2.1.0"
   apache-airflow-providers-papermill = "^2.1.0"
   apache-airflow-providers-postgres = "^2.2.0"
   apache-airflow-providers-google = "^6.1.0"
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   Docker-compose deploys into our GCP k8s cluster
   
   ### What happened
   
   Hi,
   
   we're running Airflow for our ETL pipelines. 
   
   Our DAGs run in parallel and we spawn a fair bit of parallel DAGs and tasks every morning for our pipelines.
   We run our Airflow in a k8s cluster in GCP and we use Celery for our executors.
   And we use autopilot to dynamically scale up and down the cluster as the workload increases or decreases, thereby sometimes tearing down airflow workers.
   
   Ever since upgrading to Airflow 2.0 we've had a lot of problems with tasks getting stuck in "queued" or "running", and we've had to clean it up by manually failing the stuck tasks and re-running the DAGs.
   Following the discussions here over the last months it looks we've not been alone :-)
   
   But, after upgrading to Airflow 2.2.1 we saw a significant decrease in the number of tasks getting stuck (yay!), something we hoped for given the bug fixes for the scheduler in that release.
   However, we still have a few tasks getting stuck (Stuck = "Task in queued") on most mornings that require the same manual intervention. 
   
   I've started digging in the Airflow DB trying to see where there's a discrepancy, and every time a task gets stuck it's missing a correspondning task in the table "celery_taskmeta".
   This is a consistent pattern for the tasks that are stuck with us at this point. The task has rows in the tables "task_instance", "job", and "dag_run" with IDs referencing each other. 
   
   But the "external_executor_id" in "task_instance" is missing a corresponding entry in the "celery_taskmeta" table. So nothing ever gets executed and the task_instance is forever stuck in "queued" and never cleaned up by the scheduler.
   
   I can see in "dag_run::last_scheduling_decision" that the scheduler is continuously re-evaluating this task as the timestamp is updated, so it's inspecting it at least, but it leaves everything in the "queued" state. 
   
   The other day I bumped our Airflow to 2.2.2, but we still get the same behavior.
   And finally, whenever we get tasks that are stuck in "Queued" in this way they usually occur within the same few seconds timestamp-wise, and it correlates timewise to a timestamp when autopilot scaled down the number of airflow-workers.
   
   If the tasks end up in this orphaned/queued state then they never get executed and are stuck until we fail them. Longest I've seen so far is a few days in this state until the task was discovered.
   Restarting the scheduler does not resolve this issue and tasks are still stuck in "queued" afterwards.
   
   Would it be possible (and a good idea?) to include in the scheduler a check if a "task_instance" row has a corresponding row in the "celery_taskmeta", and if its missing in "celery_taskmeta" after a given amount of time clean it up?
   After reading about and watching Ash Berlin-Taylor's most excellent video on a deep dive into the Airflow's scheduler this does seem exactly like the check that we should add to the scheduler?
   
   Also if there's any data I can dig out and provide for this, don't hesitate to let me know.
   
   ### What you expected to happen
   
   I expect orphaned tasks in the state queued and that are missing a corresponding entry in celery_taskmeta to be cleaned up and re-executed by the scheduler.
   
   ### How to reproduce
   
   Currently no deterministic way to reproduce other than a large amount of tasks and then remove a worker at just the right time.
   Occurs every morning in a handful of tasks, but no deterministic way to reproduce it.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-976448181


   @kristoffern When this task gets stuck, does the task_instance have value in the `external_executor_id` column or it's null?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kristoffern commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
kristoffern commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-984477473


   @ephraimbuddy No I haven't noticed those error messages, but I'll keep an eye out for them. 
   Usually when I come into work the tasks have been stuck a couple of times so I would have to go back thrrough the logs and look when it occurs the next time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kristoffern commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
kristoffern commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-974013578


   @ashb Yes that sounds like the ideal solution to this problem.
   I tried looking in the code base but I'm too unfamiliar with it to find where and how it would be best implemented.
   
   Sounds on your description though like its in the spectrum of a somewhat easy fix?
   
   Personally I would be fine with a 5min timeout to avoid risking killing tasks by mistake. But we wan't to keep the system responsive and fast which would be more in line with ~1min timeout. But then again the tasks get stuck so rarely compared to the overall amount of tasks that a 5min timeout shouldn't be too much of a problem (again just speaking of our own specific situation).
   
   Thank you for the fast reply, and again. Your deep dive into the scheduler was really really good :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-973803593


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-973929940


   Hmmmm, yeah the "fix" for this might be to have a CeleryExecutor specific timeout (say 5mins? 1min even?) where if a task is still in queued state for longer than this and _doesn't_ have an external_executor_id set it should get re-submitted to celery?
   
   (Due to timing of things It is possible that a _running_ task won't have an external_executor_id set, but that is okay, it's only queued we care about).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-985314148


   > @ephraimbuddy I had a task stuck this morning, confirmed that it was stuck as described in this bug by checking the DB. It has been stuck for a few hours when I woke up. However docker had pruned the logs so that the earliest entry was still ~40min after the task had gotten stuck. Will continue to monitor for it if it's important?
   
   No need for monitoring. Just wanted to confirm that it's not the same case as this: https://github.com/apache/airflow/issues/13542#issuecomment-983261840 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kristoffern commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
kristoffern commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-976612263


   @ephraimbuddy The `external_executor_id` has a value attached to it, but that ID can't be found in the `celery_taskmeta` table.
   I did confirm that other `external_executor_id` that were successful had matching entries in the `celery_taskmeta` table :-)
   
   It's almost as if (and this is just speculation) a task is created and a row in `external_executor_id` is created, and somewhere before the system gets an ID back from the worker then that worker is killed by autopilot in GCP. And therefor no ID is added to `celery_taskmeta`.
   
   But again this last part is just speculation as I haven't found a 100% deterministic way of reproducing it. 9 mornings out of 10 though we have this problem, and most of the time tasks that are stuck in "queued" all have roughly the same starting time within a second or so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-976624512


   The UUID value is generated by the celery _client_, so I guess what you suggest is possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kristoffern edited a comment on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
kristoffern edited a comment on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-974013578


   @ashb Yes that sounds like the ideal solution to this problem.
   I tried looking in the code base but I'm too unfamiliar with it to find where and how it would be best implemented.
   
   Sounds on your description though like its in the spectrum of a somewhat easy fix?
   
   Personally I would be fine with a 5min timeout to avoid risking killing tasks by mistake. But we wan't to keep the system responsive and fast which would be more in line with ~1min timeout. But then again the tasks get stuck so rarely compared to the overall amount of tasks that a 5min timeout shouldn't be too much of a problem (again just speaking of our own specific situation).
   
   Thank you for the fast reply, and again, your deep dive into the scheduler was really really good :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy closed issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
ephraimbuddy closed issue #19699:
URL: https://github.com/apache/airflow/issues/19699


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kristoffern commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
kristoffern commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-985293147


   I had a task stuck this morning, confirmed that it was stuck as described in this bug by checking the DB.
   It has been stuck for a few hours when I woke up. However docker had pruned the logs so that the earliest entry was still ~40min after the task had gotten stuck. Will continue to monitor for it if it's important?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-983789798


   @kristoffern Do you see logs like `ERROR: could not queue task ... ` in scheduler logs when this happens?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kristoffern edited a comment on issue #19699: task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

Posted by GitBox <gi...@apache.org>.
kristoffern edited a comment on issue #19699:
URL: https://github.com/apache/airflow/issues/19699#issuecomment-985293147


   @ephraimbuddy I had a task stuck this morning, confirmed that it was stuck as described in this bug by checking the DB.
   It has been stuck for a few hours when I woke up. However docker had pruned the logs so that the earliest entry was still ~40min after the task had gotten stuck. Will continue to monitor for it if it's important?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org