You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/10/13 12:33:16 UTC
[GitHub] [airflow] AutomationDev85 opened a new issue, #27032: Worker sometimes does not reconnect to redis/celery queue after crash
AutomationDev85 opened a new issue, #27032:
URL: https://github.com/apache/airflow/issues/27032
### Apache Airflow version
2.4.1
### What happened
We are running an Airflow deployment and we hat the issue that the redis POD died and then some Task stuck in the queue state. Only after killing the worker POD the tasks were consumed by the worker again. I wanted to analyse this more in detail and saw that this behavior only occurs sometimes!
For me looks like the worker some times does not detect that the connection to the redis Pod broke:
1) If I do not see any error in the log file the worker does NOT reconnect once the worker is back!
2) If I see this error in the log of the Worker it is WORKING and Worker automatically reconnects:
[2022-10-13 06:29:55,967: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 332, in start
blueprint.start(self)
File "/home/airflow/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 628, in start
c.loop(*c.loop_args())
File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/loops.py", line 97, in asynloop
next(loop)
File "/home/airflow/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
cb(*cbargs)
File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 1326, in on_readable
self.cycle.on_readable(fileno)
File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 562, in on_readable
chan.handlers[type]()
File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 906, in _receive
ret.append(self._receive_one(c))
File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 916, in _receive_one
response = c.parse_response()
File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3505, in parse_response
response = self._execute(conn, conn.read_response)
File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3479, in _execute
return command(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 739, in read_response
response = self._parser.read_response()
File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 324, in read_response
raw = self._buffer.readline()
File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 256, in readline
self._read_from_socket()
File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 201, in _read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
### What you think should happen instead
Expected behavior is that the worker reconnects to redis automatically and starts consuming queues Tasks.
### How to reproduce
1) Run DAG 2 tasks behind each other.
2) Then start DAG and during the first task is executed, force kill the redis POD (kubectl delete pod redis-0 -n ??? --grace-period=0 --force.) To simulate a crashing POD.
3) Check if the worker reconnects automatically and executes next tasks or if task stuck in queue state and worker must be killed to fix this.
### Operating System
AKSUbuntu-1804gen2
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
Using a AKS Cluster in Azure to host Airflow.
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jens-scheffler-bosch commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
jens-scheffler-bosch commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1288431435
> > As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)
>
> I think the only way to check is to try it (maybe you can try it @jens-scheffler-bosch - that would help us to close the issue).
We are on it - just have deployed via helm chart 1.7.0 - but as this problem only appeared randomly hard to predict if resolved. I'd be okay to close with the (positive) assumption that it is fixed and we maybe come back if we see another problem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1291388879
Maybe just keep it running for a while and let us know if ~ few days of running (depending on previously observed frequency) - if you will not see it after 2x the 'average" observeation time we might assume it works :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] o-nikolas commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
o-nikolas commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1278142197
Thanks for filing this issue @AutomationDev85! I see that you're willing to submit a PR, let me know if you would like this issue assigned to you :smile:
As a follow-up: are you sure this is a bug in Airflow or is this actually a Celery issue?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1306345997
Closing. 2 weeks passed. @jens-scheffler-bosch - if you had any issue, you can comment here still.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1277537973
Thanks for opening your first issue here! Be sure to follow the issue template!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jens-scheffler-bosch commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
jens-scheffler-bosch commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1279499452
As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (https://github.com/apache/airflow/pull/25561)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1288222139
> As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)
I think the only way to check is to try it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk closed issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
Posted by GitBox <gi...@apache.org>.
potiuk closed issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash
URL: https://github.com/apache/airflow/issues/27032
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org