You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/03/05 20:21:32 UTC

[GitHub] [airflow] potiuk commented on issue #29833: Celery tasks stuck in queued state after worker crash (Set changed size during iteration)

potiuk commented on issue #29833:
URL: https://github.com/apache/airflow/issues/29833#issuecomment-1455194597

> I'm not sure if we check for workers heartbeats, where I didn't find this check in the method `_process_executor_events`, @potiuk can you please confirm this or send a link to the part which checks if the worker is still alive?

I am not THAT knowledgeable about this part, so take it with a grain of salt, so let me explain how I understand what's going on. @ephraimbuddy @ashb - maybe you can take a look anc confirm if my understanding is wrong?

Everything related to managing celery task state happens in the Celery Executor.
I don't think we are monitoring workers in any way. Each executors monitors tasks for their state and either see if they have been stalled or whether they need adoption (when they were monitored in another executor).

Eventually - if the task does not update its state (when for example worker crashed, then it should be rescheduled as stalled (by own executor) or adopted (by another one). That's how much details I know from the top of my head.

There are a few race conditions that might occur (no distributed system is ever fool proof) and I think the original design is that eventually even if a very nasty race condition happens, the tasks will eventually be rescheduled.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org