You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/06/29 09:39:13 UTC

[GitHub] [airflow] vivek-zeta opened a new issue, #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

vivek-zeta opened a new issue, #24731:
URL: https://github.com/apache/airflow/issues/24731

### Apache Airflow version

2.2.2

### What happened

We are using celery executor with Redis as broker.
We are using default settings for celery.
We are trying to test below case.
- What happens when Redis or Worker Pod goes down?

Observations:
- We tried killing Redis Pod when one task was in `queued` state.
- We observed that task which was in `queued` state stays in `queued` state even after Redis Pod comes up.
- From Airflow UI we tried clearing task and run again. But still it was getting stuck at `queued` state only.
- Task message was received at `celery worker` . But worker is not starting executing the task.
- Let us know if we can try changing any celery or airflow config to avoid this issue.

- Please help here to avoid this case. As this is very critical if this happens in production.

### What you think should happen instead

Task must not get stuck at `queued` state and it should start executing.

### How to reproduce

While task is in queued state. Kill the Redis pod.

### Operating System

k8s

### Versions of Apache Airflow Providers

_No response_

### Deployment

Other 3rd-party Helm chart

### Deployment details

_No response_

### Anything else

_No response_

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] stepanof commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

stepanof commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1322123564

   @potiuk Thank you for such extended comment, I see your point.
   
   I have one more question.
   Sometimes during Virtual IP and postgres endpoint changing airflow-worker try to restart by itself (not by 'autoheal' service). 
   But it can't restart beacuse of such error:
   ```
   [2022-11-18 18:39:34 +0300] [42] [INFO] Starting gunicorn 20.1.0
   [2022-11-18 18:39:34 +0300] [42] [INFO] Listening at: http://[::]:8793 (42)
   [2022-11-18 18:39:34 +0300] [42] [INFO] Using worker: sync
   [2022-11-18 18:39:35 +0300] [43] [INFO] Booting worker with pid: 43
   [2022-11-18 18:39:35 +0300] [44] [INFO] Booting worker with pid: 44
   [2022-11-18 18:39:35 +0300] [42] [INFO] Handling signal: term
   ERROR: Pidfile (/opt/airflow/airflow-worker.pid) already exists.
   Seems we're already running? (pid: 1)
   [2022-11-18 18:39:35 +0300] [43] [INFO] Worker exiting (pid: 43)
   [2022-11-18 18:39:35 +0300] [44] [INFO] Worker exiting (pid: 44)
   [2022-11-18 18:39:35 +0300] [42] [INFO] Shutting down: Master
   ```
   
   Airflow-worker restarting can long endlessly, and each time there will be this error.
   Manual restart of worker (`docker-compose down && docker-compose up`) fixes the problem (`/opt/airflow/airflow-worker.pid ` becomes deleted).
   
   Why '`/opt/airflow/airflow-worker.pid`' isn't deleted during automatic worker restart?
   
   During such endless automatic restart, container with worker doesn't take state "unhelthy" (because it dies immidiatedly) and 'autoheal' doesn't understand that worker should be rebooted. 
   
   Is it possible to fix it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1257252087

   I am not 100% sure if it is fixed but it's likely this is fixed https://github.com/apache/airflow/issues/24498 (see above -it is referred to the thread).
   
   Since the user @vivek-zeta @NaveenGokavarapu19 had experienced that in earlier versions, the easiest way to see if it is fixed is to try it by the user. Actually - we never know for sure if we have no detailed logs. We can always re-open it if it is not. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk closed issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk closed issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.
URL: https://github.com/apache/airflow/issues/24731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up. [airflow]

Posted by "Sathya-omnius (via GitHub)" <gi...@apache.org>.

Sathya-omnius commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1748919301

   HI Team,
   
   we have AIrflow(version 2.5.3) with celery executor and Redis queue, In one of our environments Redis health check failed and after some time it started working but the celery worker stopped working and not processing any thing ,documents got stuck in queue. I don't see any logs in Airflow worker and i can see only the following message 
   Connected
   Wed, Oct 4 2023 4:49:39 pm[2023-10-04 14:49:39,203: INFO/MainProcess] sync with celery@xxxx-pipelines-worker-79b7578f8b-qkmp4
   
   The airflow-worker stopped working , once i  restart the worker pod  all the queued documents started processing, can some one provide any fix for this scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] malthe commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

malthe commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1256852005

   @potiuk what specifically in 2.4.0rc1 addresses this issue? Is that the liveness probe suggested by @jedcunningham or something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1174453268

   Assigned you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1250394664

   Can you please check @vivek-zeta @NaveenGokavarapu19 if 2.4.0rc1 solves it ? I am closing it provisionally - unless you test it and see that it is not fixed. We can always re-open it in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1318520766

   Also. You can also configure keep-alives in your connections to make such fail-over faster, Postgres redis, PGBouncer, all of those have a way to configure keep-alives (look for sqlalchemy decumentation etc.) and you can usually configure keep alives to get connections broken faster, so that Airflow components might naturally restart due to "broken pipe" kind of errors much faster. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1322552034

BTW. I believe there is something very wrong with your restarting scenario and configuration in general - some mistakes or misunderstanding on how image entrypoint works.

```
ERROR: Pidfile (/opt/airflow/airflow-worker.pid) already exists.
Seems we're already running? (pid: 1)
```

I think there are some things you are doing wrong here and they compounded

1) seems that you run airflow as init process in your container. This is possible but you need to realise the consequences of signal propagation and do it properly. You might fall into many traps of it if you are doing it wrongly so I recommend you to read why in airflow image we use dumb-init as init process and what consequences it has (especially for celery): https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation

The .pid file should only contain '1' if your process is started as "init" process - and this means that container will be killed when your process is running. When you use dumb-init as we do by default in our image, the dumb-init has process id 1, but in your case your airflow process will always has process id 1 and that is original root cause of the problem you have.

2) then the problem is most likely that you write a .pid to a shared volume which makes the pid file remain after killing the container. This is very, very, very, very wrong. If you rely on restarting the container and your process has PID = 1, you should never save the .pid file in the shared volume that can survive the container. Because you will get the exact problem you have. Your airflow webserver will always start as init process with PID =1. So even if the process has been killed, just the fact of restarting it will create a process ID 1 so airflow is really checking the PID file created by the previous "1" process with itself (which runs with PID=1) and it will never start.

This is very much against the container philosophy. It should always be store in the ephemeral container volume, so that when your containers is stopped, the .pid file is gone. Make sure that you do not keep the .pid file in a shared volume - especially if you run your airflow command as entrypoint, because indeed, if you run

In general, if you restart whole containers rather than processes, the .pid should NEVER be stored in a shared volume - it should always be stored in the ephemeral container volume so that it gets automatically deleted when whole container gets killed.

So I think you should really rethink the way entrypoint works in your images, the way you store the .pid files get created and the way how restart process of failed container works - seems like all the three points are custom-done by you and they compound to the problem you experience. When you are using docker-compose approach, you need to reaise how this all works, how those elements interact and how to make it production-robust.

Seems that you have chosen pretty hard path to walk, and going the beaten Helm + Kubernetes path without diverging too much from the approach we proposed, would have solved most of it.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] stepanof commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

stepanof commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1318461607

@potiuk Hello Jarek.
I'm using custom airflow image based on `apache/airflow:2.4.1-python3.8`
Recently I built HA clusters for postgres database and redis. Both are used by airflow cluster (1webserver,2scheduler,2worker)
I have faced with problem in scheduler and worker in the moment when VirtualIP of redis or postgres cluster move at another node - tasks stuck in 'queqed' or 'scheduled' status.
I attach worker's logs which was stuck when redis master moved to another node.
[airflow_logs_err.txt](https://github.com/apache/airflow/files/10030701/airflow_logs_err.txt)
Restarting airflow-worker solve the problem.

To solve this problem I have added one more service at each airflow instanse - it called '[autoheal](https://hub.docker.com/r/willfarrell/autoheal/)'. It restarts docker container when it become 'unhelthy'.
We are using it in production but it is workaround solution. I think airflow scheduler and worker have to be able react on such situations without any additional services.

I am ready help you to debug this problem and find the solution, just tell me what I can do for Airflow developers.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] lozuwa commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

lozuwa commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1244561170

   If it helps, this one has been an issue for me as well. I might take a deeper look, maybe I can help. 
   
   https://github.com/apache/airflow/discussions/25651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] anu251989 commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

anu251989 commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1176123995

   @NaveenGokavarapu19 @potiuk , This issue similar which i raised below issue
   Celery worker tasks in queued status when airflow-redis-master restarted #24498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] vivek-zeta closed issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

vivek-zeta closed issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.
URL: https://github.com/apache/airflow/issues/24731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] vivek-zeta commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

vivek-zeta commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1169772317

   @potiuk @jedcunningham 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] NaveenGokavarapu19 commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

NaveenGokavarapu19 commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1169851378

   Hii i am interested in taking up this issue. Can i work on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1318516581

   Thanks for the diagnosis, but I think you applied one of "good" solutions and there is not much we can and will do in Airlfow for that.
   
   I think what you did is the right approach (one of) not a workaround. This is expected. Airflow has no support for active/active setup for Redis or postgres and expects to talk to one database server only. There is no way for airflow components to recover when there is an established connection and an IP address of the componnent it talks to change in the way that Airlfow does not even know that the other party has changed the address. this is really a deployment issue, I think airflow should not really take into account such changes. 
   
   Airflow is not a "critical/real-time" service that should react and reconfigure it's networking dynamically and we have no intention to turn it into such service. Developing such 'autohealing" service is far more costly and unless someone comes up with idea, and create Airflow Improvement Proposal and implement such auto-healing, this is not something that is going to happen. There are many consequences and complexities to implement such services and there is no need to do so for Airlfow because this is perfectly fine to restart and redeploy airflow components from time to time and this is OK - far easier and less costly for development and maintenance. 
   
   This task is put on the deployment - that's why for example in our helm chart we have liveness probes and healthy checks and auto-healing in K8S is done exactly the way you did - when service becomes unhealthy, you restart it. This is perfectly ok and perfectly viable solution - especially when things like virtual IP changes which happen infrequently.
   
   Even better solution for you will be to react on the event of IP changes and restart the services immediately. This the kind of things that usually should and can be done on the deployment level - Airlfow has no knowledge about such events and cannot react to it - but your deployment can. And should. This will help you to recover much faster. 
   
   Another option - if you want to avoid such restarts - will be to avoid changing the Virtual IP and use static IP addresses allocated to each component. Usually changing virtual IP addresses is not something that happens in enterprise setup - it is safe to assume that you can come up with the approach that IP addresses are static - even if you have some dynamically changing Public IP addresses, you can usually have static private ones and you can configure your deployment to use them.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1322343146

No idea how your liveness probe works. But generally all software that manages another software running (i.e deployment like kubernetes) have the usual sequence of events:

- check if the software is runninf and responding on some kind of liveness probe (see how liveness probe is defined in our Helm Chart for example
- when the liveness probe fails for some time (usually several times) then it announces the component and attempts to stop it
- usually it happens via sigterm and other 'soft" signals that allow the componente to "kill itself" and clean up (usually if the software is able to shutdown itself cleanly it will remove all the "pid" and the like
- when it does not succeed it wll escalate the signal (SIGTERM -> SIGHUP -> SIGKILL) giving the process time to actualy react and clean-up. SIGKILL is not possible to handle, it shuts down the process immediately and some stuff (like .pid) remain

ONLY AFTER that SEQUENCE knowingt that your component process is down, the "restart" should happen

if this is fulfilled - the woker.pid is not deteled does not matter because the process is not running any more (at most it was SIGKILLED). When airflow starts next time and the .pid is not deleted it will check if process specified in the .pid is running and if not, it will delete the pid file and run. Only when the process in .pid is still running, it will refuse to start.

And this is a general advice. This is how .pid file works for any process. Nothing Airlfow-specific. All software should be managed this way.

I have no idea how docker-compose and killing works but it shoudl do the same and you should configure docker compose to this in exactly this way (this is what for example Kubernetes does). But you should lool at the internals of docker-compose behaviour when restarting airflow in such case. I honestly don't know how to do it with docker compose. Maybe it is possible, maybe not, maybe it requires some tricks to make it works.

I personally think of docker-compose like a very poor deployment that lacks a lot of features and a lot of stability that "real" production deployment like Kubernetes does. In my opinion it lacks some of the automation and some of the deployment features - precisely the kind you obeserve, when you want to do some "real production stuff" with the software. Maybe it is because I do not know it, maybe because it is hard, maybe because it is impossible. But I believe it is a very poor cousing of K8S when it comes to running "real/serious" production deployments. When you are choosing it, you take the responsibility on you as deployment manager to sometimes do manual recovery where docker-compose wil not let you do this. It's one of the responsibilities you take on your shoulders.

And we as community decided not to spend our time on making a "production-ready" docker-compose deployment - because we know this is not something we know what advices to give and that those who decide to go this path have to solve them on their own in the way it is best for them.

Contrary to that, the "Helm Chart" which we maintain and are able to solve a lot of those problems (including liveness probes, restarts etc.). It is much closer to something that runs 'out-of-the-box" - once you have resources sorted out, a lot of the management is handled for you by helm/kubernetes combo we prepared.

I am afraid you made the choice to use docker-compose. We warned the one we have is not suitable for production (it's a quick-start) and it requires a lot of work to make it so and you need to become docker-compose expert to solve them.

Also you can take a look here, where we explain what kind of skils you need to have:

https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html#using-production-docker-images

If you want to stick with docker-compose - good luck, you will have a lot of things like that. If you find some solutions - you can contribute it back to our docs as "good practices" (but we will never turn it into "this is how you run docker-compose deployment" as this is impossible to make into a general set of advices - at most this might be some advice - "if you get into this trouble -> maybe this solution will work").

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org